Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agno.com/llms.txt

Use this file to discover all available pages before exploring further.

Evals are the regression test for your agents. Same prompts, same agents, run on a schedule, fail when behavior drifts.

Define cases

Cases live in evals/cases.py. Each case is a prompt, the agent that should answer it, and a criterion the answer must satisfy. Case is a dataclass defined at the top of the same file, so opening evals/cases.py shows you everything in one place.
evals/cases.py
CASES: tuple[Case, ...] = (
    Case(
        name="web_search_recent_anthropic_research",
        agent=web_search,
        input="What did Anthropic publish about agent research recently?",
        criteria=(
            "Answers the question by citing at least one real Anthropic URL "
            "(anthropic.com domain). The response is grounded in fetched content "
            "rather than refusing to answer."
        ),
        expected_tool_calls=(_WEB_SEARCH_TOOL,),
    ),
    # add more cases here
)
criteria is graded by an LLM judge. expected_tool_calls checks that the agent actually used the tools you expect. _WEB_SEARCH_TOOL switches between parallel_search (SDK) and web_search (MCP) based on whether PARALLEL_API_KEY is set.

Run the suite

python -m evals
Results write to Postgres via eval_db. Eval history shows up at os.agno.com alongside your sessions and traces, so you can see when a case started failing and what changed.

Diagnose failures with Claude Code

Open Claude Code and paste:
Run docs/eval-and-improve.md
Claude runs the full suite, triages every failure (bad criteria, real regression, flaky LLM judge), and proposes in-scope fixes. It edits the agent or the case, re-runs, and shows you the diff.

When to run evals

TriggerFrequency
Before deploying a change to an agentEvery time
As part of CIEvery PR
Against productionOn a weekly cron
After bumping a model versionEvery time
Wire the weekly run into the platform’s own scheduler. See scheduling for the cron API, and the next page for production setup.

What good cases look like

  • Specific. “Returns a JSON object with ticker and price” beats “Returns the right answer”.
  • Stable. Avoid prompts whose correct answer changes daily. Use phrasing like “describes a real, recent…” instead of locking in a specific result.
  • Scoped to one behavior. One case per behavior makes failures easy to read.
  • Anchored to tools. expected_tool_calls catches the failure mode where the agent confidently makes things up instead of calling a tool.

Next

Deploy to Railway →