Evals

Evals are the regression test for your agents. Same prompts, same agents, run on a schedule, fail when behavior drifts.

Define cases

Cases live in evals/cases.py. Each case is a prompt, the agent that should answer it, and a criterion the answer must satisfy. Case is a dataclass defined at the top of the same file, so opening evals/cases.py shows you everything in one place.

evals/cases.py

CASES: tuple[Case, ...] = (
    Case(
        name="web_search_recent_anthropic_research",
        agent=web_search,
        input="What did Anthropic publish about agent research recently?",
        criteria=(
            "Answers the question by citing at least one real Anthropic URL "
            "(anthropic.com domain). The response is grounded in fetched content "
            "rather than refusing to answer."
        ),
        expected_tool_calls=(_WEB_SEARCH_TOOL,),
    ),
    # add more cases here
)

criteria is graded by an LLM judge. expected_tool_calls checks that the agent actually used the tools you expect. _WEB_SEARCH_TOOL switches between parallel_search (SDK) and web_search (MCP) based on whether PARALLEL_API_KEY is set.

Run the suite

python -m evals

Results write to Postgres via eval_db. Eval history shows up at os.agno.com alongside your sessions and traces, so you can see when a case started failing and what changed.

Diagnose failures with Claude Code

Open Claude Code and paste:

Run docs/eval-and-improve.md

Claude runs the full suite, triages every failure (bad criteria, real regression, flaky LLM judge), and proposes in-scope fixes. It edits the agent or the case, re-runs, and shows you the diff.

When to run evals

Trigger	Frequency
Before deploying a change to an agent	Every time
As part of CI	Every PR
Against production	On a weekly cron
After bumping a model version	Every time

Wire the weekly run into the platform’s own scheduler. See scheduling for the cron API, and the next page for production setup.

What good cases look like

Specific. “Returns a JSON object with ticker and price” beats “Returns the right answer”.
Stable. Avoid prompts whose correct answer changes daily. Use phrasing like “describes a real, recent…” instead of locking in a specific result.
Scoped to one behavior. One case per behavior makes failures easy to read.
Anchored to tools. expected_tool_calls catches the failure mode where the agent confidently makes things up instead of calling a tool.

Deploy to Railway →

Welcome

Tutorials

Runtime

Demo OS

Define cases

Run the suite

Diagnose failures with Claude Code

When to run evals

What good cases look like

Next

Welcome

Tutorials

Runtime

Demo OS

Documentation Index

​Define cases

​Run the suite

​Diagnose failures with Claude Code

​When to run evals

​What good cases look like

​Next

Define cases

Run the suite

Diagnose failures with Claude Code

When to run evals

What good cases look like

Next