Tuning a sample LangChain chain provides practical application for optimization. This process involves identifying performance issues, applying specific techniques, and measuring their impact.Scenario: A Multi-Step Document Analysis ChainImagine a chain designed to answer questions based on a collection of technical reports. The process involves:Retrieval: Finding relevant document chunks using a vector store.Initial Answer Generation: Using an LLM to generate an answer based only on the retrieved chunks.Answer Refinement: Using a second, potentially more powerful LLM call to refine the initial answer, improve coherence, and add context based on the original question and the initial answer.This multi-step process is common but can introduce latency and increase costs due to multiple LLM interactions and data retrieval operations.Let's assume our initial chain implementation looks something like this:# Assume retriever, llm_initial, llm_refine are pre-configured # retriever: A vector store retriever # llm_initial: A moderately sized LLM for quick answer generation # llm_refine: A larger, more capable LLM for refinement from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnablePassthrough, RunnableParallel from langchain_core.output_parsers import StrOutputParser # Simplified RAG setup retrieve_docs = RunnablePassthrough.assign( context=(lambda x: x["question"]) | retriever ) # Initial answer prompt and chain initial_prompt_template = ChatPromptTemplate.from_template( "Based on this context:\n{context}\n\nAnswer the question: {question}" ) initial_answer_chain = initial_prompt_template | llm_initial | StrOutputParser() # Refinement prompt and chain refine_prompt_template = ChatPromptTemplate.from_template( "Refine this initial answer: '{initial_answer}' based on the original question: '{question}'. Ensure coherence and accuracy." ) refine_chain = refine_prompt_template | llm_refine | StrOutputParser() # Full chain combining steps full_chain = retrieve_docs | RunnablePassthrough.assign( initial_answer=initial_answer_chain ) | RunnablePassthrough.assign( final_answer = (lambda x: {"initial_answer": x["initial_answer"], "question": x["question"]}) | refine_chain ) # Example Invocation # result = full_chain.invoke({"question": "What are the scaling limits of System X?"}) # print(result["final_answer"])Step 1: Baseline Performance MeasurementBefore optimizing, we need a baseline. We can use simple timing or integrate with a tracing tool like LangSmith (covered in Chapter 5). For simplicity, let's use basic timing. We run the chain with a sample question multiple times and average the results.import time import statistics question = "Summarize the main findings regarding performance degradation under load." num_runs = 5 latencies = [] # Assume token tracking is implemented separately or via LangSmith for _ in range(num_runs): start_time = time.time() # result = full_chain.invoke({"question": question}) # Execute the chain # Simulate execution time for demonstration time.sleep(12 + (random.random() * 6 - 3)) # Simulate 9-15 sec latency end_time = time.time() latencies.append(end_time - start_time) average_latency = statistics.mean(latencies) print(f"Average latency (baseline): {average_latency:.2f} seconds") # Let's assume baseline token count observed via logs/LangSmith: ~2100 tokens per queryLet's say our baseline measurement yields:Average Latency: 13.5 secondsAverage Token Usage: 2100 tokens (estimated)Using LangSmith or detailed logging, we might find the breakdown:Retrieval: 1.5 secondsInitial Answer LLM (llm_initial): 4.0 seconds (800 tokens)Refinement LLM (llm_refine): 8.0 seconds (1300 tokens)The refinement step (llm_refine) is the most significant bottleneck in both latency and token consumption.Step 2: Applying Optimization TechniquesLet's apply some techniques discussed earlier.Technique 1: Caching LLM ResponsesIdentical questions or intermediate processing steps might occur frequently. Caching LLM responses can dramatically reduce latency and cost for repeat requests. Let's add an in-memory cache. For production, you'd typically use a more persistent cache like Redis, SQL, or specialized vector caching.from langchain.cache import InMemoryCache from langchain.globals import set_llm_cache # Set up a simple in-memory cache set_llm_cache(InMemoryCache()) # No changes needed to the chain definition itself if LLMs are configured globally # Or, apply cache directly when initializing LLMs: # llm_initial = ChatOpenAI(..., cache=InMemoryCache()) # llm_refine = ChatOpenAI(..., cache=InMemoryCache()) # Re-run the timing test, ensuring to run the *same* question multiple times # The first run will be slow, subsequent identical runs should be much faster.After adding caching and running the same query again:First Run Latency: ~13.5 seconds (cache miss)Subsequent Runs Latency: ~1.6 seconds (cache hit for both LLMs, dominated by retrieval)Token Usage (Subsequent Runs): 0 tokens (served from cache)Caching is highly effective for repeated inputs but doesn't help with novel queries.Technique 2: Optimizing the Refinement StepThe refinement LLM call is our primary bottleneck for novel queries.Prompt Engineering: Can we make the refinement prompt more concise? Perhaps the initial prompt can request a more structured output that requires less refinement. Let's assume we refine the refine_prompt_template to be slightly shorter, saving maybe 50 tokens per call on average.Model Selection: Is the powerful llm_refine strictly necessary? Could a slightly smaller, faster model achieve acceptable quality? Let's hypothetically switch llm_refine to a model known to be ~30% faster and use ~30% fewer tokens on average for similar tasks, perhaps accepting a minor quality trade-off.Conditional Execution: Maybe refinement isn't always needed. We could add a step before refinement that uses a simpler model or rule-based check to determine if the initial answer is good enough. If it is, skip the refinement call entirely.Let's simulate the effect of switching llm_refine to a faster model and refining the prompt slightly.# Assume llm_refine_faster is configured (a faster, slightly less powerful model) # Assume refine_prompt_template_optimized is slightly shorter # Update the refine_chain part refine_chain_optimized = refine_prompt_template_optimized | llm_refine_faster | StrOutputParser() # Update the full chain definition to use the optimized refine chain full_chain_optimized = retrieve_docs | RunnablePassthrough.assign( initial_answer=initial_answer_chain ) | RunnablePassthrough.assign( final_answer = (lambda x: {"initial_answer": x["initial_answer"], "question": x["question"]}) | refine_chain_optimized ) # Re-run the timing test for novel queries (cache won't help here initially) # ... timing code ...Step 3: Re-evaluating PerformanceLet's measure the performance of full_chain_optimized for novel queries (cache miss scenario):# Simulate execution time for demonstration after optimization # Retrieval: 1.5s (no change) # Initial LLM: 4.0s (no change, 800 tokens) # Refine LLM (Faster Model + Prompt): 8.0s * 0.7 ≈ 5.6s (1300 tokens * 0.7 - 50 ≈ 860 tokens) # Total Latency ≈ 1.5 + 4.0 + 5.6 = 11.1s # Total Tokens ≈ 800 + 860 = 1660 tokens # --- Python code to simulate and measure --- latencies_optimized = [] for _ in range(num_runs): start_time = time.time() # result = full_chain_optimized.invoke({"question": question}) # Execute the optimized chain # Simulate optimized execution time time.sleep(10 + (random.random() * 3 - 1.5)) # Simulate 8.5 - 11.5 sec latency end_time = time.time() latencies_optimized.append(end_time - start_time) average_latency_optimized = statistics.mean(latencies_optimized) print(f"Average latency (optimized): {average_latency_optimized:.2f} seconds") # Estimated optimized token count: ~1660 tokensOur new measurements for novel queries might look like this:Average Latency: 10.5 seconds (originally 13.5s)Average Token Usage: 1660 tokens (originally 2100 tokens)Results ComparisonLet's visualize the improvement:{"layout": {"title": "Chain Performance Optimization Results", "xaxis": {"title": "Metric"}, "yaxis": {"title": "Value"}, "barmode": "group", "legend": {"traceorder": "normal"}}, "data": [{"type": "bar", "name": "Baseline", "x": ["Avg Latency (s)", "Avg Tokens"], "y": [13.5, 2100], "marker": {"color": "#ff6b6b"}}, {"type": "bar", "name": "Optimized (Novel Query)", "x": ["Avg Latency (s)", "Avg Tokens"], "y": [10.5, 1660], "marker": {"color": "#4dabf7"}}, {"type": "bar", "name": "Optimized (Cached Query)", "x": ["Avg Latency (s)", "Avg Tokens"], "y": [1.6, 0], "marker": {"color": "#69db7c"}}]}Comparison of average latency and token usage before and after applying caching and model optimization techniques. Note the dramatic improvement for cached queries.Cost ImplicationsReducing token count directly impacts cost. If the combined cost of llm_initial and llm_refine was $0.002 per 1K tokens:Baseline Cost: (2100 / 1000) * $0.002 = $0.0042 per queryOptimized Cost (Novel): (1660 / 1000) * $0.0018 (assuming faster model is cheaper) ≈ $0.0030 per query (approx. 28% saving)Optimized Cost (Cached): $0.00 per query (100% saving)Trade-offs and Further StepsWe achieved significant improvements:Caching: Excellent for repeat queries, minimal code change, potential memory/storage cost for the cache itself.Model Swapping/Prompt Tuning: Reduced latency and cost for all queries, but potentially involved a minor trade-off in the quality of the refined answer. Evaluating this quality difference is important (covered in Chapter 5).This practice exercise demonstrates a typical tuning workflow:Measure: Establish a baseline.Identify: Pinpoint the most expensive steps (time, tokens, cost).Optimize: Apply targeted techniques like caching, model selection, prompt engineering, or structural changes (e.g., conditional execution).Re-evaluate: Measure the impact of your changes.Iterate: Performance tuning is often iterative. Further improvements might involve optimizing the retrieval step, exploring parallel execution for independent tasks, or implementing more sophisticated caching.Remember to leverage tools like LangSmith for detailed tracing and analysis, which simplifies the identification and measurement phases considerably in complex applications.