Learn how to evaluate your Agno Agents and Teams across three key dimensions - accuracy (using LLM-as-a-judge), performance (runtime and memory), and reliability (tool calls).
Evals are unit tests for your Agents and Teams, use them judiciously to measure and improve their performance. Agno provides 3 dimensions for evaluating Agents:
Accuracy evals use input/output pairs to measure your Agents and Teams performance against a gold-standard answer. Use a larger model to score the Agent’s responses (LLM-as-a-judge).
In this example, the AccuracyEval
will run the Agent with the input, then use a larger model (o4-mini
) to score the Agent’s response according to the guidelines provided.
You can also run the AccuracyEval
on an existing output (without running the Agent).
Performance evals measure the latency and memory footprint of an Agent or Team.
While latency will be dominated by the model API response time, we should still keep performance top of mind and track the Agent or Team’s performance with and without certain components. Eg: it would be good to know what’s the average latency with and without storage, memory, with a new prompt, or with a new model.
What makes an Agent or Team reliable?
The first check is to ensure the Agent makes the expected tool calls. Here’s an example:
Reliability evals are currently in beta
.
Learn how to evaluate your Agno Agents and Teams across three key dimensions - accuracy (using LLM-as-a-judge), performance (runtime and memory), and reliability (tool calls).
Evals are unit tests for your Agents and Teams, use them judiciously to measure and improve their performance. Agno provides 3 dimensions for evaluating Agents:
Accuracy evals use input/output pairs to measure your Agents and Teams performance against a gold-standard answer. Use a larger model to score the Agent’s responses (LLM-as-a-judge).
In this example, the AccuracyEval
will run the Agent with the input, then use a larger model (o4-mini
) to score the Agent’s response according to the guidelines provided.
You can also run the AccuracyEval
on an existing output (without running the Agent).
Performance evals measure the latency and memory footprint of an Agent or Team.
While latency will be dominated by the model API response time, we should still keep performance top of mind and track the Agent or Team’s performance with and without certain components. Eg: it would be good to know what’s the average latency with and without storage, memory, with a new prompt, or with a new model.
What makes an Agent or Team reliable?
The first check is to ensure the Agent makes the expected tool calls. Here’s an example:
Reliability evals are currently in beta
.