All Courses

Hybrid Search: Combining Dense and Sparse Retrievers

While dense retrievers, powered by sophisticated embedding models, are adept at understanding semantic relationships and user intent, they can sometimes miss the mark on queries demanding exact lexical matches. For instance, a model might grasp the general topic of a query but fail to prioritize documents containing a very specific, yet less common, product code or technical term mentioned by the user. Conversely, traditional sparse retrievers like BM25 excel at finding these exact terms but lack the broader contextual understanding to retrieve semantically similar but lexically different content. They operate on keyword matching, which, while precise, can be brittle when faced with synonyms, paraphrases, or complex natural language queries.

Hybrid search offers a pragmatic and often highly effective solution by combining the strengths of both dense and sparse retrieval techniques. The objective is to create a retrieval pipeline that is more comprehensive and reliable, capturing both the semantic depth offered by embeddings and the keyword precision of sparse methods. This synergistic approach helps ensure that the context provided to the generator model is as relevant and complete as possible.

The Advantage of Two Perspectives

Combining dense and sparse retrievers brings several advantages to your RAG system, significantly enhancing the quality of the information pipeline:

Improved Recall: You increase the chances of finding all relevant documents. If a document is missed by the dense retriever due to an unusual phrasing or a rare keyword, the sparse retriever might catch it, and vice-versa.
Enhanced Precision (Potentially): While simply unioning results can increase noise, sophisticated fusion techniques often favor documents that are highly ranked by both systems. This consensus can be a strong signal of relevance.
Adaptability to Query Variation: The system becomes more effective across a wider range of query types. It can handle vague, conversational queries (where dense search shines) as well as highly specific, keyword-driven queries (where sparse search excels).
Better Handling of Domain-Specific Terms: Sparse retrievers ensure that critical domain-specific identifiers, acronyms, or jargon, which might be underrepresented in general-purpose embedding model training data, are still effectively retrieved.

Consider a scenario in a financial RAG application. A user might ask, "What are the implications of regulation XYZ on Q3 earnings for tech companies?" A dense retriever could find documents discussing earnings and tech companies generally. A sparse retriever would ensure documents explicitly mentioning "regulation XYZ" are surfaced. A hybrid approach would ideally prioritize documents that satisfy both aspects of the query.

Architecting Hybrid Search: Fusion Strategies

The most prevalent method for implementing hybrid search is score fusion (also known as late fusion). In this setup, both the dense and sparse retrievers process the input query independently. Each produces a list of candidate documents along with their respective relevance scores. The core challenge then lies in intelligently merging these two sets of results into a single, re-ranked list.

A common architecture for hybrid search involves parallel retrieval followed by a fusion step to combine and re-rank results.

Several techniques exist for fusing these scores:

Weighted Sum: This is a straightforward approach where the final score for a document $d$ is a weighted combination of its normalized scores from the dense retriever ( $S_{dense}(d)$ ) and the sparse retriever ( $S_{sparse}(d)$ ).
$S_{hybrid}(d) = w_{dense} \cdot \text{norm}(S_{dense}(d)) + w_{sparse} \cdot \text{norm}(S_{sparse}(d))$
Here, $w_{dense}$ and $w_{sparse}$ are weights that sum to 1 (e.g., $w_{dense}=0.6, w_{sparse}=0.4$ ), reflecting the perceived importance of each retriever. The function $\text{norm}()$ represents score normalization, a significant step discussed later. The choice of weights is often empirical and may require tuning based on your specific dataset and query patterns.
Reciprocal Rank Fusion (RRF): RRF offers an elegant way to combine rankings without needing to worry too much about the absolute score values or their distributions, which can vary wildly between different retrieval systems. For each document $d$ , its RRF score is calculated by summing the reciprocal of its rank in each retriever's result list.
$RRFScore(d) = \sum_{i \in \text{Retrievers}} \frac{1}{k + \text{rank}_i(d)}$
If a document is not found by a retriever, its rank for that retriever can be considered infinite (or practically, a very large number), making its contribution to the sum negligible. The constant $k$ (commonly 60) helps to down-weight the influence of documents that are ranked very highly by only one retriever but poorly by others. Documents consistently ranked well across multiple systems receive higher RRF scores.
Two-Stage Retrieval (Cascade): Another approach involves a cascaded, or two-stage, process.
- Sparse-then-Dense: First, use a fast sparse retriever (like BM25) to quickly fetch a larger set of candidate documents (e.g., top 100). Then, apply the more computationally intensive dense retriever (or a re-ranker, as discussed in the next section) only to this smaller, pre-filtered set. This can be efficient for very large document corpora.
- Dense-then-Sparse: Less common for the initial retrieval, but one might use dense retrieval to find semantically relevant candidates and then use sparse matching as a filter or boost for those containing exact keywords.

The choice of fusion strategy often depends on the characteristics of your data, the types of queries you expect, and the computational resources available.

Implementation Details for Effective Fusion

Successfully implementing hybrid search requires attention to a few important details:

Score Normalization

Sparse retrievers (like BM25) and dense retrievers (cosine similarity of embeddings) produce scores on different scales and with different distributions. BM25 scores can range widely, while cosine similarity is typically bounded between -1 and 1 (or 0 and 1 for positive embeddings). Directly adding these scores in a weighted sum without normalization can lead to one retriever's scores dominating the other's, irrespective of the chosen weights.

Common normalization techniques include:

Min-Max Scaling: Scales scores to a specific range, typically [0, 1]. $S_{norm} = (S - S_{min}) / (S_{max} - S_{min})$
Z-Score Standardization: Transforms scores to have a mean of 0 and a standard deviation of 1. $S_{norm} = (S - \mu) / \sigma$
Rank-based Normalization: Using ranks directly, as RRF does, sidesteps score scaling issues.

The $S_{min}$ , $S_{max}$ , $\mu$ , and $\sigma$ parameters for normalization should ideally be estimated from a representative set of query results for each retriever.

Weighting Schemes in Score Fusion

If using a weighted sum, determining the optimal weights ( $w_{dense}$ and $w_{sparse}$ ) is significant.

Static Weights: These are fixed values, often determined through experimentation on a validation dataset. You might start with $w_{dense} = 0.5, w_{sparse} = 0.5$ and tune from there.
Dynamic Weights: For more advanced systems, weights could be adjusted dynamically based on query characteristics. For example, a very short query with specific entities might benefit from a higher $w_{sparse}$ , while a longer, more descriptive query might favor a higher $w_{dense}$ . Implementing this typically requires a query analysis component.
Learned Weights: It's also possible to train a small model (e.g., logistic regression) to learn the optimal weights, using features from the query and the retrieved documents. This, however, adds complexity to the system.

For most production RAG systems, starting with RRF or a well-tuned static weighted sum provides a strong baseline.

Choosing Your Retrievers

Sparse Retriever: BM25 is a widely adopted choice, available in libraries like Elasticsearch, OpenSearch, and rank_bm25 in Python.
Dense Retriever: Options range from pre-trained models from libraries like Sentence Transformers (e.g., all-MiniLM-L6-v2, multi-qa-mpnet-base-dot-v1) to more powerful proprietary models or models you've fine-tuned on your specific domain (as discussed in the section on "Domain-Specific Fine-tuning of Embedding Models").

Evaluating Hybrid Search Performance

When evaluating a hybrid search setup, consider:

Standard retrieval metrics (nDCG, MAP, Precision@K, Recall@K) on a benchmark dataset. Compare the hybrid system against each individual retriever.
A/B testing different fusion strategies or weightings in a live environment to measure their impact on downstream RAG task performance (e.g., answer quality, user satisfaction).
Qualitative analysis: Manually review results for queries where the hybrid system significantly differs from individual retrievers. Understand why it performs better (or worse) in those cases. This can provide insights for further tuning.

Challenges in Hybrid Search

While powerful, hybrid search introduces some considerations:

Complexity: You are managing and maintaining two different retrieval systems, including their indexing, querying, and updates.
Tuning Effort: Finding the optimal fusion strategy, normalization methods, and weights can require careful experimentation.
Computational Cost: Running two retrievers will generally incur more latency and computational cost than a single retriever, although strategies like sparse-then-dense can mitigate this.

Despite these challenges, the benefits in retrieval quality often make hybrid search a worthwhile investment for production RAG systems that demand high accuracy and adaptability. By combining the lexical precision of sparse search with the semantic understanding of dense search, you create a more resilient and effective foundation for your generator model.

Was this section helpful?