Evals is a way to measure the quality of your Agents and Teams. Agno provides 3 dimensions for evaluating Agents:

Evaluation Dimensions

Quick Start

Here’s a simple example of running an accuracy evaluation:
quick_eval.py
from typing import Optional
from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval, AccuracyResult
from agno.models.openai import OpenAIChat
from agno.tools.calculator import CalculatorTools

# Create an evaluation
evaluation = AccuracyEval(
    model=OpenAIChat(id="o4-mini"),
    agent=Agent(model=OpenAIChat(id="gpt-5-mini"), tools=[CalculatorTools()]),
    input="What is 10*5 then to the power of 2? do it step by step",
    expected_output="2500",
    additional_guidelines="Agent output should include the steps and the final answer.",
)

# Run the evaluation
result: Optional[AccuracyResult] = evaluation.run(print_results=True)
assert result is not None and result.avg_score >= 8

Best Practices

  • Start Simple: Begin with basic accuracy tests before moving to complex performance and reliability evaluations
  • Use Multiple Test Cases: Don’t rely on a single test case - create comprehensive test suites
  • Track Over Time: Monitor your eval results as you make changes to your agents
  • Combine Dimensions: Use all three evaluation dimensions for a complete picture of agent quality

Next Steps

Dive deeper into each evaluation dimension:
  1. Accuracy Evals - Learn LLM-as-a-judge techniques and multiple test case strategies
  2. Performance Evals - Measure latency, memory usage, and compare different configurations
  3. Reliability Evals - Test tool calls, error handling, and rate limiting behavior