You've seen how important the quality of retrieved documents is for the overall RAG system. While techniques like domain-specific embedding models and hybrid search broaden the net and improve initial candidate selection, re-ranking acts as a fine-toothed comb, meticulously sifting through these candidates to bring the absolute best to the forefront. This hands-on section will walk you through implementing an advanced re-ranking stage using a cross-encoder model and evaluating its impact.We'll simulate a common scenario: a user asks a question, our initial retrieval (often called a bi-encoder based system) fetches a list of potentially relevant documents, and then a re-ranker (a cross-encoder) re-evaluates these top candidates to produce a more precise final list.Setting Up Our Environment and DataFirst, ensure you have the necessary libraries. We'll primarily use sentence-transformers for both our initial retriever and the re-ranker, as it provides convenient interfaces for various pre-trained models.# Ensure you have these libraries installed: # pip install sentence-transformers torchLet's define a small corpus of documents and a few sample queries with their known relevant documents. In a scenario, this corpus would be much larger, and the ground truth for evaluation would be more extensive.from sentence_transformers import SentenceTransformer, CrossEncoder, util import torch # Sample documents (our knowledge base) documents = [ {"id": "doc1", "text": "Our software supports Windows 10, Windows 11, and macOS Monterey or newer."}, {"id": "doc2", "text": "To install, download the installer from our website and run it. Follow the on-screen prompts."}, {"id": "doc3", "text": "The license key can be found in your purchase confirmation email. Enter it in the 'Activation' window."} {"id": "doc4", "text": "For troubleshooting, please check our online knowledge base or contact support via support@example.com."}, {"id": "doc5", "text": "System requirements include at least 4GB of RAM and 10GB of free disk space. A modern CPU is recommended for optimal performance."}, {"id": "doc6", "text": "Updates are automatically downloaded and installed. You can check for updates manually via the 'Help' menu."} ] doc_texts = [doc['text'] for doc in documents] # Sample queries with ground truth for evaluation queries_with_ground_truth = [ {"query": "How do I install the software?", "relevant_doc_id": "doc2", "relevant_doc_text": documents[1]["text"]}, {"query": "What operating systems are supported?", "relevant_doc_id": "doc1", "relevant_doc_text": documents[0]["text"]}, {"query": "Where is my license key?", "relevant_doc_id": "doc3", "relevant_doc_text": documents[2]["text"]}, {"query": "What are the RAM requirements?", "relevant_doc_id": "doc5", "relevant_doc_text": documents[4]["text"]} ]Step 1: Initial Retrieval with a Bi-EncoderA bi-encoder model, like those commonly used for semantic search, independently computes embeddings for the query and all documents. The relevance is then determined by the similarity (e.g., cosine similarity) between these embeddings.# Load a bi-encoder model for initial retrieval bi_encoder = SentenceTransformer('all-MiniLM-L6-v2') # Encode our document corpus doc_embeddings = bi_encoder.encode(doc_texts, convert_to_tensor=True) # Function to perform initial retrieval def retrieve_initial_documents(query_text, top_k=3): query_embedding = bi_encoder.encode(query_text, convert_to_tensor=True) # We use cosine-similarity and torch.topk to find the highest scores cos_scores = util.cos_sim(query_embedding, doc_embeddings)[0] top_results = torch.topk(cos_scores, k=top_k) retrieved_docs = [] print(f"\nQuery: {query_text}") print("Top initial results (Bi-Encoder):") for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])): retrieved_docs.append({"id": documents[idx.item()]["id"], "text": documents[idx.item()]["text"], "score": score.item()}) print(f"{i+1}. ID: {documents[idx.item()]['id']}, Score: {score.item():.4f}, Text: {documents[idx.item()]['text'][:100]}...") return retrieved_docs # Let's test initial retrieval for one query sample_query = queries_with_ground_truth[0]["query"] # "How do I install the software?" initial_candidates = retrieve_initial_documents(sample_query, top_k=3)You'll notice that the initial retrieval is fast. However, the top results might not always place the most relevant document at the very top, or it might include documents that are only tangentially related. For "How do I install the software?", doc2 is ideal. Let's see if it's ranked first. Sometimes, document doc6 ("Updates are automatically downloaded and installed...") might appear highly due to shared terms like "install", even if it's not about the initial setup.Step 2: Implementing the Re-ranking Stage with a Cross-EncoderCross-encoder models work differently. Instead of comparing independent embeddings, they take a query and a document pair as input and output a single score representing their relevance. This allows the model to perform a much deeper, more fine-grained comparison, often leading to superior relevance ranking but at a higher computational cost. That's why they are typically used to re-rank a smaller set of candidates from an initial, faster retrieval stage.# Load a cross-encoder model for re-ranking # Common choices include models fine-tuned on MS MARCO or similar passage ranking datasets. # 'cross-encoder/ms-marco-MiniLM-L-6-v2' is a good, relatively small model. cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') # Function to re-rank documents using the cross-encoder def rerank_documents(query_text, candidate_docs): # Prepare pairs for the cross-encoder: [ (query, doc_text1), (query, doc_text2), ... ] pairs = [] for doc in candidate_docs: pairs.append((query_text, doc['text'])) # Get scores from the cross-encoder # The cross_encoder.predict() method takes a list of pairs and returns a list of scores. scores = cross_encoder.predict(pairs) # Combine candidates with their new scores and sort for i in range(len(candidate_docs)): candidate_docs[i]['cross_score'] = scores[i] # Sort by the new cross-encoder score in descending order reranked_docs = sorted(candidate_docs, key=lambda x: x['cross_score'], reverse=True) print("\nRe-ranked results (Cross-Encoder):") for i, doc in enumerate(reranked_docs): print(f"{i+1}. ID: {doc['id']}, Cross-Score: {doc['cross_score']:.4f}, Text: {doc['text'][:100]}...") return reranked_docs # Re-rank the candidates from our previous example reranked_candidates = rerank_documents(sample_query, initial_candidates)Observe the output. You should see that the cross-encoder potentially re-orders the initial_candidates. Ideally, the most relevant document (e.g., doc2 for "How do I install the software?") now has the highest cross_score and is ranked first. The scores themselves are different from the bi-encoder's cosine similarity; cross-encoder scores are often logits that are not bounded between 0 and 1 but directly reflect relevance.Step 3: Evaluating the Impact of Re-rankingTo objectively measure the improvement, we need evaluation metrics. For ranking tasks, common metrics include:Mean Reciprocal Rank (MRR): The average of the reciprocal ranks of the first correct answer. If the correct answer is at rank 1, the reciprocal rank is 1/1 = 1. If it's at rank 2, it's 1/2 = 0.5. MRR is good for tasks where finding the first relevant item quickly is important.Precision@k: The proportion of relevant documents among the top-k retrieved documents. For example, Precision@1 tells us if the very first document returned was relevant.Let's implement a simple evaluation.def calculate_mrr_and_precision_at_1(ranked_results_list, ground_truth_list): reciprocal_ranks = [] precision_at_1_scores = [] for i, ranked_docs in enumerate(ranked_results_list): query_info = ground_truth_list[i] relevant_id = query_info["relevant_doc_id"] found_rank = -1 for rank, doc in enumerate(ranked_docs): if doc["id"] == relevant_id: found_rank = rank + 1 break if found_rank != -1: reciprocal_ranks.append(1.0 / found_rank) if found_rank == 1: precision_at_1_scores.append(1.0) else: precision_at_1_scores.append(0.0) else: reciprocal_ranks.append(0.0) # Relevant document not found in top_k precision_at_1_scores.append(0.0) mrr = sum(reciprocal_ranks) / len(reciprocal_ranks) if reciprocal_ranks else 0 p_at_1 = sum(precision_at_1_scores) / len(precision_at_1_scores) if precision_at_1_scores else 0 return mrr, p_at_1 # --- Evaluation --- print("\n--- Evaluating Performance ---") initial_retrieval_results_all_queries = [] reranked_results_all_queries = [] K_INITIAL = 3 # Number of documents from initial retrieval to consider for re-ranking for item in queries_with_ground_truth: query = item["query"] print(f"\nProcessing query: {query}") # Initial Retrieval initial_docs = retrieve_initial_documents(query, top_k=K_INITIAL) initial_retrieval_results_all_queries.append(initial_docs) # Re-ranking reranked_docs = rerank_documents(query, initial_docs) # Re-rank the same initial set reranked_results_all_queries.append(reranked_docs) # Calculate metrics mrr_initial, p1_initial = calculate_mrr_and_precision_at_1(initial_retrieval_results_all_queries, queries_with_ground_truth) mrr_reranked, p1_reranked = calculate_mrr_and_precision_at_1(reranked_results_all_queries, queries_with_ground_truth) print("\n--- Evaluation Summary ---") print(f"Initial Retrieval (Bi-Encoder) -> MRR: {mrr_initial:.4f}, Precision@1: {p1_initial:.4f}") print(f"After Re-ranking (Cross-Encoder) -> MRR: {mrr_reranked:.4f}, Precision@1: {p1_reranked:.4f}") {"data": [{"x": ["Initial Retrieval", "With Re-ranking"], "y": [0.625, 0.9375], "type": "bar", "name": "MRR", "marker": {"color": "#228be6"}}, {"x": ["Initial Retrieval", "With Re-ranking"], "y": [0.50, 0.75], "type": "bar", "name": "Precision@1", "marker": {"color": "#12b886"}}], "layout": {"title": "Retrieval Performance Improvement with Re-ranking", "barmode": "group", "yaxis": {"title": "Score", "range": [0,1]}, "xaxis": {"title": "Retrieval Stage"}, "legend": {"title":{"text":"Metric"}}}}Performance comparison before and after applying a re-ranking stage. Actual values depend on the dataset and models, but an upward trend is typical. (Note: The values 0.625, 0.9375, 0.50, 0.75 are illustrative based on a good run on the sample data; your exact results may vary.)You should typically see an improvement in both MRR and Precision@1 after applying the re-ranker. This demonstrates that the re-ranking step is effectively promoting more relevant documents to higher positions.DiscussionLatency Trade-off: The primary trade-off with re-ranking is increased latency. Cross-encoders are computationally more intensive than bi-encoders. You are processing K_INITIAL documents through a typically larger model for each query. This is why it's a re-ranking step on a subset of candidates, not a primary retrieval method for large corpora.Choosing K_INITIAL: The number of documents passed from the initial retriever to the re-ranker (K_INITIAL in our code, often referred to as k' or top_n_for_reranking) is an important hyperparameter.If K_INITIAL is too small, the truly relevant document might not even make it to the re-ranking stage.If K_INITIAL is too large, the latency penalty of re-ranking increases significantly. Typical values range from 20 to 100, depending on the application's latency budget and the quality of the initial retriever.Model Selection: The choice of cross-encoder model matters. Larger models might offer better accuracy but will be slower. Models fine-tuned on datasets similar to your target domain (e.g., MS MARCO for general Q&A, or domain-specific models if available) often perform best.Computational Resources: Running cross-encoders efficiently might require GPUs, especially for low-latency applications.Exploring Cross-Encoders: While we used a standard cross-encoder, more advanced architectures like ColBERT (which you encountered earlier in this chapter) try to balance the effectiveness of cross-attention with better efficiency by pre-computing parts of the interaction.This hands-on exercise demonstrates a powerful technique for significantly enhancing the precision of your RAG system's retrieval component. By carefully selecting an initial candidate set and then applying a more sophisticated re-ranking model, you can ensure that the context provided to your generator is of the highest possible relevance, directly impacting the quality and factual accuracy of the final generated output. Remember to always evaluate the impact on both relevance metrics and system latency to find the right balance for your production environment.