| Example | Description |
|---|---|
| Accuracy | Accuracy examples evaluate how well responses match expected outputs. |
| Agent As Judge | Agent-as-judge examples evaluate output quality with model-based scoring. |
| Performance | Performance examples benchmark runtime and memory impact for agents and teams. |
| Reliability | Reliability examples validate whether expected tool calls are made correctly. |