Skip to main content
Agent as Judge evaluations let you define custom quality criteria and use an LLM to score your Agent’s responses. You provide evaluation criteria (like “professional tone”, “factual accuracy”, or “user-friendliness”), and an evaluator model assesses how well the Agent’s output meets those standards.

Basic Example

In this example, the AgentAsJudgeEval will evaluate the output of the Agent with their input, providing a score of the Agent’s response according to the custom criteria provided.
agent_as_judge.py
from agno.agent import Agent
from agno.db.sqlite import SqliteDb
from agno.eval.agent_as_judge import AgentAsJudgeEval
from agno.models.openai import OpenAIChat

# Setup database to persist eval results
db = SqliteDb(db_file="tmp/agent_as_judge_basic.db")

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    instructions="You are a technical writer. Explain concepts clearly and concisely.",
    db=db,
)

response = agent.run("Explain what an API is")

evaluation = AgentAsJudgeEval(
    name="Explanation Quality",
    criteria="Explanation should be clear, beginner-friendly, and use simple language",
    scoring_strategy="numeric",  # Score 1-10
    threshold=7,  # Pass if score >= 7
    db=db,
)

result = evaluation.run(
    input="Explain what an API is",
    output=str(response.content),
    print_results=True,
)

Custom Evaluator Agent

You can use a custom agent to evaluate responses with specific instructions:
agent_as_judge_custom_evaluator.py
from agno.agent import Agent
from agno.eval.agent_as_judge import AgentAsJudgeEval
from agno.models.openai import OpenAIChat

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    instructions="Explain technical concepts simply.",
)

response = agent.run("Explain what an API is")

# Create a custom evaluator with specific instructions
custom_evaluator = Agent(
    model=OpenAIChat(id="gpt-4o"),
    description="Strict technical evaluator",
    instructions="You are a strict evaluator. Only pass exceptionally clear and accurate explanations.",
)

evaluation = AgentAsJudgeEval(
    name="Technical Accuracy",
    criteria="Explanation must be technically accurate and comprehensive",
    evaluator_agent=custom_evaluator,
)

result = evaluation.run(
    input="Explain what an API is",
    output=str(response.content),
    print_results=True,
    print_summary=True,
)

Params

ParameterTypeDefaultDescription
criteriastr""The evaluation criteria describing what makes a good response (required).
scoring_strategyLiteral["numeric", "binary"]"binary"Scoring mode: "numeric" (1-10 scale) or "binary" (pass/fail).
thresholdint7Minimum score to pass (only used for numeric strategy).
on_failOptional[Callable]NoneCallback function triggered when evaluation fails.
additional_guidelinesOptional[Union[str, List[str]]]NoneExtra evaluation guidelines beyond the main criteria.
nameOptional[str]NoneName for the evaluation.
modelOptional[Model]NoneModel to use for judging (defaults to gpt-5-mini if not provided).
evaluator_agentOptional[Agent]NoneCustom agent to use as evaluator.
print_summaryboolFalsePrint summary of evaluation results.
print_resultsboolFalsePrint detailed evaluation results.
file_path_to_save_resultsOptional[str]NoneFile path to save evaluation results.
debug_modeboolFalseEnable debug mode for detailed logging.
dbOptional[Union[BaseDb, AsyncBaseDb]]NoneDatabase to store evaluation results.
telemetryboolTrueEnable telemetry.
run_in_backgroundboolFalseRun evaluation as background task (non-blocking).

Methods

run() / arun()

Run the evaluation synchronously (run()) or asynchronously (arun()).
ParameterTypeDefaultDescription
inputOptional[str]NoneInput text for single evaluation.
outputOptional[str]NoneOutput text for single evaluation.
casesOptional[List[Dict[str, str]]]NoneList of input/output pairs for batch evaluation.
print_summaryboolFalsePrint summary of evaluation results.
print_resultsboolFalsePrint detailed evaluation results.
Provide either (input, output) for single evaluation OR cases for batch evaluation, not both.

Examples

Developer Resources