Introduction
What are Evals?
Evals are unit tests for your Agents. Use them judiciously to evaluate, measure and improve the performance of your Agents over time.
We typically evaludate Agents on 3 dimensions:
- Accuracy: How complete/correct/accurate is the Agent’s response (LLM-as-a-judge)
- Performance: How fast does the Agent respond and what’s the memory footprint?
- Reliability: Does the Agent make the expected tool calls?
Accuracy
Accuracy evals use input/output pairs to evaluate the Agent’s performance. They use another model to score the Agent’s responses (LLM-as-a-judge).
Example
Performance
Performance evals measure the latency and memory footprint of the Agent operations.
While latency will be dominated by the model API response time, we should still keep performance top of mind and track the agent performance with and without certain components. Eg: it would be good to know what’s the average latency with and without storage, memory, with a new prompt, or with a new model.
Example
Reliability
What makes an Agent reliable?
- Does the Agent make the expected tool calls?
- Does the Agent handle errors gracefully?
- Does the Agent respect the rate limits of the model API?
Example
The first check is to ensure the Agent makes the expected tool calls. Here’s an example:
Reliability evals are currently in beta
.