Evals are the regression test for your agents. Same prompts, same agents, run on a schedule, fail when behavior drifts.Documentation Index
Fetch the complete documentation index at: https://docs.agno.com/llms.txt
Use this file to discover all available pages before exploring further.
Define cases
Cases live inevals/cases.py. Each case is a prompt, the agent that should answer it, and a criterion the answer must satisfy. Case is a dataclass defined at the top of the same file, so opening evals/cases.py shows you everything in one place.
evals/cases.py
criteria is graded by an LLM judge. expected_tool_calls checks that the agent actually used the tools you expect. _WEB_SEARCH_TOOL switches between parallel_search (SDK) and web_search (MCP) based on whether PARALLEL_API_KEY is set.
Run the suite
eval_db. Eval history shows up at os.agno.com alongside your sessions and traces, so you can see when a case started failing and what changed.
Diagnose failures with Claude Code
Open Claude Code and paste:When to run evals
| Trigger | Frequency |
|---|---|
| Before deploying a change to an agent | Every time |
| As part of CI | Every PR |
| Against production | On a weekly cron |
| After bumping a model version | Every time |
What good cases look like
- Specific. “Returns a JSON object with
tickerandprice” beats “Returns the right answer”. - Stable. Avoid prompts whose correct answer changes daily. Use phrasing like “describes a real, recent…” instead of locking in a specific result.
- Scoped to one behavior. One case per behavior makes failures easy to read.
- Anchored to tools.
expected_tool_callscatches the failure mode where the agent confidently makes things up instead of calling a tool.