Building upon the necessity for rigorous assessment outlined earlier in this chapter, we now move to the practical implementation of an evaluation framework. Theoretical metrics and evaluation strategies are valuable, but translating them into a repeatable, automated process is essential for efficiently iterating on and improving your agentic systems. This section guides you through constructing a foundational evaluation harness, a software tool designed specifically to run your agents against predefined test cases and measure their performance systematically.
An evaluation harness serves as a controlled environment where agent behavior can be observed, measured, and compared. It standardizes the execution process, ensuring that variations in results are attributable to the agent's logic or configuration, not inconsistencies in the testing setup. While sophisticated harnesses can become complex systems themselves, we'll focus on establishing a solid, extensible baseline.
A basic yet effective evaluation harness typically consists of the following components:
initialize()
and run(input)
.Let's consider practical implementation details, using Python as an example.
Test cases can be defined in various formats, such as YAML or JSON files, or directly as Python dictionaries or data classes. A structured format is preferable for clarity and ease of loading.
# Example test case structure (using Python dict)
test_case_1 = {
"id": "TC001",
"goal": "Find the current CEO of ExampleCorp and summarize their latest public statement regarding AI.",
"available_tools": ["web_search", "company_database_lookup"],
"success_criteria": {
"ceo_identified": "Jane Doe", # Example ground truth
"statement_found": True,
"summary_relevant": True, # Requires semantic check
"constraints": ["Must use web_search tool at least once"]
},
"max_steps": 10 # Optional constraint
}
test_suite = [test_case_1, ...] # Load from file or define inline
For expert-level evaluation, test suites should cover a wide range of scenarios: simple lookups, multi-step reasoning tasks, tasks requiring complex tool interactions, scenarios designed to trigger known failure modes (e.g., ambiguity, conflicting information), and edge cases.
A simple base class can define the expected interface:
from abc import ABC, abstractmethod
class BaseAgentInterface(ABC):
def __init__(self, config):
self.config = config
# Initialize LLM, tools, memory based on config
@abstractmethod
def run(self, goal: str, available_tools: list) -> dict:
"""
Executes the agent's logic for the given goal.
Returns a dictionary containing the final answer, execution trajectory,
tool calls, errors, etc.
"""
pass
# Example implementation for a specific agent type
class MyReActAgent(BaseAgentInterface):
def run(self, goal: str, available_tools: list) -> dict:
# Implementation of the ReAct loop for this agent
trajectory = []
final_answer = None
tool_calls = []
errors = []
# ... agent execution logic ...
print(f"Running ReAct Agent on goal: {goal}") # Example logging
# Simulate execution
trajectory.append("Thought: I need to find the CEO.")
trajectory.append("Action: company_database_lookup(company='ExampleCorp')")
tool_calls.append({"tool": "company_database_lookup", "args": {"company": "ExampleCorp"}, "output": "CEO: Jane Doe"})
trajectory.append("Observation: Found CEO is Jane Doe.")
trajectory.append("Thought: Now search for her latest statement.")
trajectory.append("Action: web_search(query='Jane Doe ExampleCorp latest AI statement')")
tool_calls.append({"tool": "web_search", "args": {"query": "Jane Doe ExampleCorp latest AI statement"}, "output": "Snippet: ...committed to responsible AI..."})
trajectory.append("Observation: Found relevant statement snippet.")
final_answer = "CEO is Jane Doe. Latest statement highlights commitment to responsible AI."
return {
"final_answer": final_answer,
"trajectory": trajectory,
"tool_calls": tool_calls,
"errors": errors,
"steps_taken": len(trajectory) // 2 # Approximate steps
}
This abstraction allows the harness to swap different agent implementations easily.
Metrics should operate on the results dictionary returned by the agent's run
method and the ground truth from the test case.
def calculate_metrics(agent_output: dict, test_case: dict) -> dict:
metrics = {}
criteria = test_case["success_criteria"]
# Example: Basic goal completion check
is_successful = True
if "ceo_identified" in criteria:
# Simple string check (can be more sophisticated)
if criteria["ceo_identified"] not in agent_output.get("final_answer", ""):
is_successful = False
if criteria.get("statement_found", False):
# Placeholder for a check on the final answer content
if "statement" not in agent_output.get("final_answer", "").lower():
is_successful = False # Simplified check
metrics["success"] = is_successful
# Example: Tool usage check
required_tool_used = False
if "constraints" in criteria:
for constraint in criteria["constraints"]:
if "Must use web_search" in constraint:
if any(call["tool"] == "web_search" for call in agent_output.get("tool_calls", [])):
required_tool_used = True
else:
# Constraint violated, potentially mark as failure or track separately
pass # Add logic as needed
metrics["required_tool_used"] = required_tool_used
# Example: Resource usage
metrics["steps_taken"] = agent_output.get("steps_taken", 0)
# Could also add token counts, latency if tracked
return metrics
For expert use cases, metric calculation might involve sophisticated techniques like semantic similarity checks using embeddings for answer relevance, parsing tool arguments for correctness, or analyzing the reasoning trajectory for logical fallacies.
The engine iterates, executes, calculates, and logs.
import json
import datetime
def run_evaluation(agent_interface: BaseAgentInterface, test_suite: list):
results = []
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
log_filename = f"evaluation_results_{timestamp}.jsonl"
print(f"Starting evaluation run. Logging to {log_filename}")
for i, test_case in enumerate(test_suite):
print(f"Running Test Case {i+1}/{len(test_suite)}: {test_case['id']}")
try:
agent_output = agent_interface.run(
goal=test_case["goal"],
available_tools=test_case["available_tools"]
)
computed_metrics = calculate_metrics(agent_output, test_case)
result_entry = {
"test_case_id": test_case["id"],
"goal": test_case["goal"],
"agent_config": agent_interface.config, # Store agent config used
"raw_output": agent_output,
"metrics": computed_metrics,
"success": computed_metrics.get("success", False) # Promote primary metric
}
except Exception as e:
print(f"Error running test case {test_case['id']}: {e}")
result_entry = {
"test_case_id": test_case["id"],
"goal": test_case["goal"],
"agent_config": agent_interface.config,
"error": str(e),
"success": False,
"metrics": {"success": False} # Ensure metrics dict exists
}
results.append(result_entry)
# Log results incrementally (JSON Lines format)
with open(log_filename, 'a') as f:
f.write(json.dumps(result_entry) + '\n')
print("Evaluation run completed.")
# Aggregate and report summary statistics
total_tests = len(results)
successful_tests = sum(r.get('success', False) for r in results)
success_rate = (successful_tests / total_tests) * 100 if total_tests > 0 else 0
print(f"\nSummary:")
print(f"Total Test Cases: {total_tests}")
print(f"Successful: {successful_tests}")
print(f"Success Rate: {success_rate:.2f}%")
# Add more detailed reporting or visualization generation here
generate_summary_visualization(results, timestamp) # Call visualization function
return results
def generate_summary_visualization(results: list, timestamp: str):
# Example: Visualize success rate by test case (simplified)
if not results: return
ids = [r['test_case_id'] for r in results]
success_values = [1 if r.get('success', False) else 0 for r in results] # 1 for success, 0 for fail
# Create a simple bar chart showing success/failure per test case
plotly_fig = {
"data": [
{
"x": ids,
"y": success_values,
"type": "bar",
"marker": {
"color": ['#37b24d' if s == 1 else '#f03e3e' for s in success_values] # Green for success, Red for failure
},
"name": "Test Outcome"
}
],
"layout": {
"title": f"Evaluation Results ({timestamp})",
"xaxis": {"title": "Test Case ID", "type": "category"},
"yaxis": {"title": "Outcome (1=Success, 0=Fail)", "tickvals": [0, 1], "ticktext": ["Fail", "Success"]},
"template": "plotly_white" # Use a clean template
}
}
# Save or display the chart (implementation depends on environment)
# For web output, you might save this JSON or pass it to a frontend component
viz_filename = f"evaluation_summary_{timestamp}.json"
with open(viz_filename, 'w') as f:
json.dump(plotly_fig, f)
print(f"Visualization data saved to {viz_filename}")
# Example Plotly JSON structure (single line for embedding):
# ```plotly
# {"data": [{"x": ["TC001", "TC002"], "y": [1, 0], "type": "bar", "marker": {"color": ["#37b24d", "#f03e3e"]}, "name": "Test Outcome"}], "layout": {"title": "Evaluation Results (Example)", "xaxis": {"title": "Test Case ID", "type": "category"}, "yaxis": {"title": "Outcome (1=Success, 0=Fail)", "tickvals": [0, 1], "ticktext": ["Fail", "Success"]}, "template": "plotly_white"}}
# ```
# Example usage
# agent_config = {"llm": "gpt-4", "react_params": {...}}
# agent = MyReActAgent(config=agent_config)
# evaluation_results = run_evaluation(agent, test_suite)
Bar chart illustrating hypothetical outcomes for four distinct test cases, indicating success (green) or failure (red).
For expert-level applications, enhance this basic harness:
asyncio
library.Building even a basic evaluation harness provides immense value by transforming evaluation from an ad-hoc activity into a systematic, repeatable process. It forms the foundation for data-driven development, enabling you to reliably track progress, identify regressions, and pinpoint areas for optimization in your complex agentic systems. Start simple, and iteratively enhance your harness as your evaluation needs become more sophisticated.
© 2025 ApX Machine Learning