Theory provides the foundation, but practical application solidifies understanding. Learn to evaluate a LangChain agent using LangSmith. This involves setting up a simple agent, creating an evaluation dataset, running the agent against the dataset, and implementing custom evaluation logic to assess its performance programmatically.This exercise assumes you have LangSmith set up and your API key configured in your environment (LANGCHAIN_API_KEY). You should also have basic familiarity with creating LangChain agents and using tools.1. Define the Agent Under TestFirst, let's define a straightforward agent that uses a search tool. We'll use Tavily as our search tool for this example, but you could substitute another search tool or custom tools. Ensure you have the necessary packages installed (langchain, langchain_openai, tavily-python, langsmith). Also, set your Tavily API key (TAVILY_API_KEY) and OpenAI API key (OPENAI_API_KEY) as environment variables.import os from langchain_openai import ChatOpenAI from langchain_community.tools.tavily_search import TavilySearchResults from langchain import hub from langchain.agents import create_openai_functions_agent, AgentExecutor from langsmith import Client # Ensure API keys are set as environment variables # os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" # os.environ["TAVILY_API_KEY"] = "YOUR_TAVILY_API_KEY" # os.environ["LANGCHAIN_API_KEY"] = "YOUR_LANGSMITH_API_KEY" # os.environ["LANGCHAIN_TRACING_V2"] = "true" # Ensure tracing is enabled # os.environ["LANGCHAIN_PROJECT"] = "Agent Evaluation Example" # Optional: Define a LangSmith project # Initialize the LLM and Tool llm = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0) search_tool = TavilySearchResults(max_results=2) tools = [search_tool] # Get the prompt template # Using a standard OpenAI Functions Agent prompt prompt = hub.pull("hwchase17/openai-functions-agent") # Create the agent # This agent is designed to work with models that support function calling agent = create_openai_functions_agent(llm, tools, prompt) # Create the AgentExecutor agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) # Test invocation (optional) # print(agent_executor.invoke({"input": "What was the score of the last SF Giants game?"})) We now have agent_executor, which represents the system we want to evaluate.2. Create an Evaluation Dataset in LangSmithEvaluation requires a set of inputs and, ideally, expected outputs or criteria against which to judge the agent's performance. Let's create a small dataset directly in LangSmith using the client library. We'll include inputs (questions for our agent) and optional reference outputs.# Initialize LangSmith client client = Client() dataset_name = "Simple Search Agent Questions V1" dataset_description = "Basic questions requiring web search." # Check if dataset exists, create if not try: dataset = client.read_dataset(dataset_name=dataset_name) print(f"Dataset '{dataset_name}' already exists.") except Exception: # LangSmith client raises a generic Exception if not found dataset = client.create_dataset( dataset_name=dataset_name, description=dataset_description, ) print(f"Created dataset '{dataset_name}'.") # Define examples (input questions and optional reference outputs) examples = [ ("What is the capital of France?", "Paris"), ("Who won the 2023 Formula 1 Championship?", "Max Verstappen"), ("What is the main component of air?", "Nitrogen"), ("Summarize the plot of the movie 'Inception'.", "A thief who steals information by entering people's dreams takes on the inverse task of planting an idea into a target's subconscious."), # Example reference output ] # Add examples to the dataset for input_query, reference_output in examples: client.create_example( inputs={"input": input_query}, outputs={"reference": reference_output}, # Using 'reference' key for the expected output dataset_id=dataset.id, ) print(f"Added {len(examples)} examples to dataset '{dataset_name}'.") After running this code, you should see a new dataset named "Simple Search Agent Questions V1" in your LangSmith account, populated with the defined examples. The outputs dictionary in create_example can store reference values, labels, or any other information useful for evaluation.3. Run Evaluation Using LangSmithWith the agent defined and the dataset created, we can now run the agent over each example in the dataset using LangSmith's evaluation utilities. We'll start without a custom evaluator, primarily to collect traces and observe behavior.from langsmith.evaluation import evaluate # Define a function that encapsulates the agent invocation # This is needed for the evaluate function def agent_predictor(inputs: dict) -> dict: """Runs the agent executor for a given input dictionary.""" return agent_executor.invoke({"input": inputs["input"]}) # Assumes dataset input is "input" # Run the evaluation # This will execute the agent_predictor for each example in the dataset # Results and traces will be automatically logged to LangSmith evaluation_results = evaluate( agent_predictor, data=dataset_name, # Can pass dataset name directly description="Initial evaluation run for the search agent.", project_name="Agent Eval Run - Simple Search", # Optional: Logs to a specific project run # metadata={"agent_version": "1.0"}, # Optional: Add metadata to the run ) print("Evaluation run completed. Check LangSmith for results.") Navigate to your LangSmith project. You should find a new evaluation run associated with the dataset. Click on it to explore:Traces: Detailed logs of each agent execution for every example in the dataset. You can inspect the LLM calls, tool usage, inputs, and outputs.Results Table: A summary view showing inputs, actual outputs, reference outputs (if provided), latency, and token counts.4. Implement a Custom EvaluatorSimply running the agent and tracing is useful for debugging, but quantitative evaluation requires defining specific metrics. Let's create a custom evaluator function that checks if the agent's output contains the reference answer (case-insensitive).from langsmith.evaluation import EvaluationResult, run_evaluator @run_evaluator def check_contains_reference(run, example) -> EvaluationResult: """ Checks if the agent's output contains the reference answer (case-insensitive). Args: run: The LangSmith run object for the agent execution. example: The LangSmith example object from the dataset. Returns: An EvaluationResult with a score (1 for contains, 0 otherwise) and a descriptive. """ agent_output = run.outputs.get("output") if run.outputs else None reference_output = example.outputs.get("reference") if example.outputs else None if agent_output is None or reference_output is None: # Handle cases where output or reference is missing score = 0 comment = "Agent output or reference output missing." elif str(reference_output).lower() in str(agent_output).lower(): score = 1 # Success: Reference found in agent output comment = "Reference answer found." else: score = 0 # Failure: Reference not found comment = f"Reference '{reference_output}' not found in output." return EvaluationResult( key="contains_reference", # Name of the metric score=score, # The numeric score (0 or 1 here) comment=comment # Optional qualitative feedback ) This function uses the @run_evaluator decorator, indicating it's designed for LangSmith evaluation. It accesses the agent's actual output (run.outputs) and the reference output stored in the dataset (example.outputs). It returns an EvaluationResult object containing a metric name and a score.5. Run Evaluation with the Custom EvaluatorNow, let's re-run the evaluation, this time including our custom evaluator.# Run evaluation again, now with the custom evaluator custom_eval_results = evaluate( agent_predictor, data=dataset_name, evaluators=[check_contains_reference], # Pass the custom evaluator function description="Evaluation run with custom 'contains_reference' check.", project_name="Agent Eval Run - Custom Check", # Log to a different run project # metadata={"agent_version": "1.0", "evaluator": "contains_reference_v1"}, ) print("Evaluation run with custom evaluator completed. Check LangSmith.")Go back to LangSmith and view this new evaluation run. In the results table, you should now see a new column titled contains_reference (matching the reference in our EvaluationResult). This column will display the score (0 or 1) for each example based on our custom logic. You can sort and filter by this metric to quickly identify failures. Hovering over or clicking into the feedback cell often shows the comment provided by the evaluator.If we were to visualize the results of this simple evaluation (hypothetically, based on the contains_reference scores), it might look something like this:{"data": [{"x": ["Pass (1)", "Fail (0)"], "y": [3, 1], "type": "bar", "marker": {"color": ["#37b24d", "#f03e3e"]}}], "layout": {"title": "Evaluation Results: 'contains_reference' Metric", "xaxis": {"title": "Score"}, "yaxis": {"title": "Number of Examples"}, "bargap": 0.2}}A simple bar chart showing the count of examples passing (score=1) and failing (score=0) the contains_reference evaluation metric.Further StepsThis practical exercise demonstrated the core loop of evaluating an agent with LangSmith: defining the agent, creating a dataset, running evaluation, and implementing custom checks. From here, you can build more complex evaluations:More Sophisticated Evaluators: Implement evaluators using regular expressions, semantic similarity checks (e.g., comparing embeddings of actual and reference outputs), or checks for specific structural properties of the output.LLM-as-Judge: Use another LLM to evaluate the agent's output based on criteria like helpfulness, correctness (against the reference), or lack of harmful content. LangChain provides helpers for creating these evaluators.Multiple Evaluators: Pass a list of different evaluator functions to evaluate to calculate multiple metrics simultaneously.Dataset Management: Develop strategies for curating and versioning your evaluation datasets as your application evolves.Integration with CI/CD: Automate these evaluation runs as part of your continuous integration pipeline to catch regressions before deployment.Systematic evaluation using tools like LangSmith is indispensable for building reliable LLM applications. It goes further than anecdotal testing, providing quantifiable metrics and detailed tracing to understand and improve agent performance over time.