Documentation Index
Fetch the complete documentation index at: https://docs.agno.com/llms.txt
Use this file to discover all available pages before exploring further.
A judge is a classifier whose input is a (prompt, response) pair and whose output is a score. Constrain the score with a Literal so it stays on scale.
from typing import Literal
from agno.agent import Agent
from agno.models.openai import OpenAIResponses
from pydantic import BaseModel, Field
class Score(BaseModel):
overall: Literal[1, 2, 3, 4, 5] = Field(
..., description="Overall quality, 5 is excellent"
)
agent = Agent(
model=OpenAIResponses(id="gpt-5.5"),
instructions=(
"Score the response on overall quality from 1 (unusable) to 5 "
"(excellent). Use the full scale. Reserve 5 for genuinely "
"excellent responses."
),
output_schema=Score,
)
def build_input(prompt: str, response: str) -> str:
return f"Prompt:\n{prompt}\n\nResponse:\n{response}"
prompt = "Explain why the sky is blue, in one sentence."
result = agent.run(build_input(prompt, "It just is.")).content
# Score(overall=1)
Add a rationale
A free-text rationale makes the score auditable and surfaces rubric drift.
from typing import Literal
from pydantic import BaseModel, Field
class Score(BaseModel):
overall: Literal[1, 2, 3, 4, 5] = Field(..., description="Overall quality")
rationale: str = Field(..., description="Why this score, citing the response")
Keep the score field before the rationale so the model commits to a number, then explains it.
Multi-dimension rubric
Score each dimension independently, then an overall. Independent fields stop one weak dimension from dragging the rest.
from typing import Literal
from pydantic import BaseModel, Field
Rating = Literal[1, 2, 3, 4, 5]
class RubricScore(BaseModel):
correctness: Rating = Field(..., description="Factually correct")
completeness: Rating = Field(..., description="Covers what was asked")
clarity: Rating = Field(..., description="Easy to follow")
concision: Rating = Field(..., description="No padding")
overall: Rating = Field(..., description="Holistic quality")
Picking the shape
| You need | Schema |
|---|
| One quality number | Literal[1..5] |
| Number plus justification | Add a rationale field after the score |
| Per-criterion breakdown | One Literal field per dimension |
| A vs B instead of a score | Preference data |
Relationship to evals
This is the same primitive as single-label classification, pointed at model outputs instead of raw data. When the judge is the deliverable, it lives here. When it scores a system under test, see Evals.
Next steps
| Task | Guide |
|---|
| Rank two responses | Preference data |
| Reduce single-model bias | Quality pipeline |
Developer Resources