Skip to main content
ExampleDescription
AccuracyAccuracy examples evaluate how well responses match expected outputs.
Agent As JudgeAgent-as-judge examples evaluate output quality with model-based scoring.
PerformancePerformance examples benchmark runtime and memory impact for agents and teams.
ReliabilityReliability examples validate whether expected tool calls are made correctly.