Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agno.com/llms.txt

Use this file to discover all available pages before exploring further.

A judge is a classifier whose input is a (prompt, response) pair and whose output is a score. Constrain the score with a Literal so it stays on scale.
from typing import Literal

from agno.agent import Agent
from agno.models.openai import OpenAIResponses
from pydantic import BaseModel, Field


class Score(BaseModel):
    overall: Literal[1, 2, 3, 4, 5] = Field(
        ..., description="Overall quality, 5 is excellent"
    )


agent = Agent(
    model=OpenAIResponses(id="gpt-5.5"),
    instructions=(
        "Score the response on overall quality from 1 (unusable) to 5 "
        "(excellent). Use the full scale. Reserve 5 for genuinely "
        "excellent responses."
    ),
    output_schema=Score,
)


def build_input(prompt: str, response: str) -> str:
    return f"Prompt:\n{prompt}\n\nResponse:\n{response}"


prompt = "Explain why the sky is blue, in one sentence."
result = agent.run(build_input(prompt, "It just is.")).content
# Score(overall=1)

Add a rationale

A free-text rationale makes the score auditable and surfaces rubric drift.
from typing import Literal

from pydantic import BaseModel, Field


class Score(BaseModel):
    overall: Literal[1, 2, 3, 4, 5] = Field(..., description="Overall quality")
    rationale: str = Field(..., description="Why this score, citing the response")
Keep the score field before the rationale so the model commits to a number, then explains it.

Multi-dimension rubric

Score each dimension independently, then an overall. Independent fields stop one weak dimension from dragging the rest.
from typing import Literal

from pydantic import BaseModel, Field

Rating = Literal[1, 2, 3, 4, 5]


class RubricScore(BaseModel):
    correctness: Rating = Field(..., description="Factually correct")
    completeness: Rating = Field(..., description="Covers what was asked")
    clarity: Rating = Field(..., description="Easy to follow")
    concision: Rating = Field(..., description="No padding")
    overall: Rating = Field(..., description="Holistic quality")

Picking the shape

You needSchema
One quality numberLiteral[1..5]
Number plus justificationAdd a rationale field after the score
Per-criterion breakdownOne Literal field per dimension
A vs B instead of a scorePreference data

Relationship to evals

This is the same primitive as single-label classification, pointed at model outputs instead of raw data. When the judge is the deliverable, it lives here. When it scores a system under test, see Evals.

Next steps

TaskGuide
Rank two responsesPreference data
Reduce single-model biasQuality pipeline

Developer Resources