Basic Example
In this example, theAgentAsJudgeEval will evaluate the output of the Agent with their input, providing a score of the Agent’s response according to the custom criteria provided.
agent_as_judge.py
Custom Evaluator Agent
You can use a custom agent to evaluate responses with specific instructions:agent_as_judge_custom_evaluator.py

Params
| Parameter | Type | Default | Description |
|---|---|---|---|
criteria | str | "" | The evaluation criteria describing what makes a good response (required). |
scoring_strategy | Literal["numeric", "binary"] | "binary" | Scoring mode: "numeric" (1-10 scale) or "binary" (pass/fail). |
threshold | int | 7 | Minimum score to pass (only used for numeric strategy). |
on_fail | Optional[Callable] | None | Callback function triggered when evaluation fails. |
additional_guidelines | Optional[Union[str, List[str]]] | None | Extra evaluation guidelines beyond the main criteria. |
name | Optional[str] | None | Name for the evaluation. |
model | Optional[Model] | None | Model to use for judging (defaults to gpt-5-mini if not provided). |
evaluator_agent | Optional[Agent] | None | Custom agent to use as evaluator. |
print_summary | bool | False | Print summary of evaluation results. |
print_results | bool | False | Print detailed evaluation results. |
file_path_to_save_results | Optional[str] | None | File path to save evaluation results. |
debug_mode | bool | False | Enable debug mode for detailed logging. |
db | Optional[Union[BaseDb, AsyncBaseDb]] | None | Database to store evaluation results. |
telemetry | bool | True | Enable telemetry. |
run_in_background | bool | False | Run evaluation as background task (non-blocking). |
Methods
run() / arun()
Run the evaluation synchronously (run()) or asynchronously (arun()).
| Parameter | Type | Default | Description |
|---|---|---|---|
input | Optional[str] | None | Input text for single evaluation. |
output | Optional[str] | None | Output text for single evaluation. |
cases | Optional[List[Dict[str, str]]] | None | List of input/output pairs for batch evaluation. |
print_summary | bool | False | Print summary of evaluation results. |
print_results | bool | False | Print detailed evaluation results. |
Provide either (
input, output) for single evaluation OR cases for batch evaluation, not both.Examples
Basic Agent as Judge
Basic usage with numeric scoring and failure callbacks
Agent as Judge as Post-Hook
Automatic evaluation after agent runs
Developer Resources
- View Cookbook