Skip to main content
Learn how to evaluate your Agno Agents and Teams across three key dimensions - accuracy (using LLM-as-a-judge), performance (runtime and memory), and reliability (tool calls).

Evaluation Dimensions

Quick Start

Here’s a simple example of running an accuracy evaluation:
quick_eval.py
from typing import Optional
from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval, AccuracyResult
from agno.models.openai import OpenAIChat
from agno.tools.calculator import CalculatorTools

# Create an evaluation
evaluation = AccuracyEval(
    model=OpenAIChat(id="o4-mini"),
    agent=Agent(model=OpenAIChat(id="gpt-5-mini"), tools=[CalculatorTools()]),
    input="What is 10*5 then to the power of 2? do it step by step",
    expected_output="2500",
    additional_guidelines="Agent output should include the steps and the final answer.",
)

# Run the evaluation
result: Optional[AccuracyResult] = evaluation.run(print_results=True)
assert result is not None and result.avg_score >= 8

Best Practices

  • Start Simple: Begin with basic accuracy tests before progressing to complex performance and reliability evaluations
  • Use Multiple Test Cases: Don’t rely on a single test case—build comprehensive test suites that cover edge cases
  • Track Over Time: Monitor your eval metrics continuously as you iterate on your agents
  • Combine Dimensions: Evaluate across all three dimensions for a holistic view of agent quality

Next Steps

Dive deeper into each evaluation dimension:
  1. Accuracy Evals - Learn LLM-as-a-judge techniques and multiple test case strategies
  2. Performance Evals - Measure latency, memory usage, and compare different configurations
  3. Reliability Evals - Test tool calls, error handling, and rate limiting behavior