> ## Documentation Index
> Fetch the complete documentation index at: https://docs.agno.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Evals

> Lock in agent behavior with regression tests that run on a schedule.

Evals are the regression test for your agents. Same prompts, same agents, run on a schedule, fail when behavior drifts.

## Define cases

Cases live in `evals/cases.py`. Each case is a prompt, the agent that should answer it, and a criterion the answer must satisfy. `Case` is a dataclass defined at the top of the same file, so opening `evals/cases.py` shows you everything in one place.

```python evals/cases.py theme={null}
CASES: tuple[Case, ...] = (
    Case(
        name="web_search_recent_anthropic_research",
        agent=web_search,
        input="What did Anthropic publish about agent research recently?",
        criteria=(
            "Answers the question by citing at least one real Anthropic URL "
            "(anthropic.com domain). The response is grounded in fetched content "
            "rather than refusing to answer."
        ),
        expected_tool_calls=(_WEB_SEARCH_TOOL,),
    ),
    # add more cases here
)
```

`criteria` is graded by an LLM judge. `expected_tool_calls` checks that the agent actually used the tools you expect. `_WEB_SEARCH_TOOL` switches between `parallel_search` (SDK) and `web_search` (MCP) based on whether `PARALLEL_API_KEY` is set.

## Run the suite

```bash theme={null}
python -m evals
```

Results write to Postgres via `eval_db`. Eval history shows up at [os.agno.com](https://os.agno.com) alongside your sessions and traces, so you can see when a case started failing and what changed.

## Diagnose failures with Claude Code

Open Claude Code in the cloned `agent-platform` repo and paste:

```
Run docs/eval-and-improve.md
```

Claude runs the full suite, triages every failure (bad criteria, real regression, flaky LLM judge), and proposes in-scope fixes. It edits the agent or the case, re-runs, and shows you the diff.

## When to run evals

| Trigger                               | Frequency        |
| ------------------------------------- | ---------------- |
| Before deploying a change to an agent | Every time       |
| As part of CI                         | Every PR         |
| Against production                    | On a weekly cron |
| After bumping a model version         | Every time       |

Wire the weekly run into the platform's own scheduler. See [scheduling](/features/scheduling) for the cron API, and the next page for production setup.

## What good cases look like

* **Specific.** "Returns a JSON object with `ticker` and `price`" beats "Returns the right answer".
* **Stable.** Avoid prompts whose correct answer changes daily. Use phrasing like "describes a real, recent..." instead of locking in a specific result.
* **Scoped to one behavior.** One case per behavior makes failures easy to read.
* **Anchored to tools.** `expected_tool_calls` catches the failure mode where the agent confidently makes things up instead of calling a tool.

## Next

[Deploy to Railway →](/tutorials/agent-platform/deploy-to-railway)
