Skip to main content
This example demonstrates how to use Agent as Judge evaluation to assess the main agent’s output as a background task. Unlike blocking validation, background evaluation:
  • Does NOT block the response to the user
  • Logs evaluation results for monitoring and analytics
  • Can trigger alerts or store metrics without affecting latency
Use cases:
  • Quality monitoring in production
  • Compliance auditing
  • Validating hallucinations or other inappropriate content
1

Create a Python file

background_output_evaluation.py
from agno.agent import Agent
from agno.db.sqlite import AsyncSqliteDb
from agno.eval.agent_as_judge import AgentAsJudgeEval
from agno.models.openai import OpenAIResponses
from agno.os import AgentOS

# Setup database for agent and evaluation storage
db = AsyncSqliteDb(db_file="tmp/evaluation.db")

# Create the evaluator using Agent as Judge
evaluator = AgentAsJudgeEval(
    db=db,
    name="Response Quality Check",
    model=OpenAIResponses(id="gpt-5.2"),
    criteria="Response should be helpful, accurate, and well-structured",
    additional_guidelines=[
        "Evaluate if the response addresses the user's question directly",
        "Check if the information provided is correct and reliable",
        "Assess if the response is well-organized and easy to understand",
    ],
    threshold=7,
    run_in_background=True,  # Runs evaluation without blocking the response
)

# Create the main agent with Agent as Judge evaluation
main_agent = Agent(
    id="support-agent",
    name="CustomerSupportAgent",
    model=OpenAIResponses(id="gpt-5.2"),
    instructions=[
        "You are a helpful customer support agent.",
        "Provide clear, accurate, and friendly responses.",
        "If you don't know something, say so honestly.",
    ],
    db=db,
    post_hooks=[evaluator],  # Automatically evaluates each response
    markdown=True,
)

# Create AgentOS
agent_os = AgentOS(agents=[main_agent])
app = agent_os.get_app()


if __name__ == "__main__":
    agent_os.serve(app="background_output_evaluation:app", port=7777, reload=True)
2

Set up your virtual environment

uv venv --python 3.12
source .venv/bin/activate
3

Install dependencies

uv pip install -U agno openai uvicorn
4

Export your OpenAI API key

export OPENAI_API_KEY="your_openai_api_key_here"
5

Run the server

python background_output_evaluation.py
6

Test the endpoint

curl -X POST http://localhost:7777/agents/support-agent/runs \
  -F "message=How do I reset my password?" \
  -F "stream=false"
The response will be returned immediately. The evaluation runs in the background and results are stored in the database.

What Happens

  1. User sends a request to the agent
  2. The agent processes and generates a response
  3. The response is sent to the user immediately
  4. Background evaluation runs:
    • AgentAsJudgeEval automatically evaluates the response against the criteria
    • Scores the response on a scale of 1-10
    • Stores results in the database

Production Extensions

In production, you could extend this pattern to:
ExtensionDescription
Database StorageStore evaluations for analytics dashboards
AlertingUse on_fail callback to send alerts when evaluations fail
ObservabilityLog to platforms like Datadog or OpenTelemetry
A/B TestingCompare response quality across model versions
Training DataBuild datasets for fine-tuning
Background evaluation is ideal for quality monitoring without impacting user experience. For scenarios where you need to block bad responses, use synchronous hooks instead.

Global Background Hooks

Run all hooks as background tasks

Per-Hook Background

Mix synchronous and background hooks