Background Output Evaluation

This example demonstrates how to use Agent as Judge evaluation to assess the main agent’s output as a background task. Unlike blocking validation, background evaluation:

Does NOT block the response to the user
Logs evaluation results for monitoring and analytics
Can trigger alerts or store metrics without affecting latency

Use cases:

Quality monitoring in production
Compliance auditing
Validating hallucinations or other inappropriate content

Create a Python file

background_output_evaluation.py

from agno.agent import Agent
from agno.db.sqlite import AsyncSqliteDb
from agno.eval.agent_as_judge import AgentAsJudgeEval
from agno.models.openai import OpenAIResponses
from agno.os import AgentOS

# Setup database for agent and evaluation storage
db = AsyncSqliteDb(db_file="tmp/evaluation.db")

# Create the evaluator using Agent as Judge
evaluator = AgentAsJudgeEval(
    db=db,
    name="Response Quality Check",
    model=OpenAIResponses(id="gpt-5.2"),
    criteria="Response should be helpful, accurate, and well-structured",
    additional_guidelines=[
        "Evaluate if the response addresses the user's question directly",
        "Check if the information provided is correct and reliable",
        "Assess if the response is well-organized and easy to understand",
    ],
    threshold=7,
    run_in_background=True,  # Runs evaluation without blocking the response
)

# Create the main agent with Agent as Judge evaluation
main_agent = Agent(
    id="support-agent",
    name="CustomerSupportAgent",
    model=OpenAIResponses(id="gpt-5.2"),
    instructions=[
        "You are a helpful customer support agent.",
        "Provide clear, accurate, and friendly responses.",
        "If you don't know something, say so honestly.",
    ],
    db=db,
    post_hooks=[evaluator],  # Automatically evaluates each response
    markdown=True,
)

# Create AgentOS
agent_os = AgentOS(agents=[main_agent])
app = agent_os.get_app()


if __name__ == "__main__":
    agent_os.serve(app="background_output_evaluation:app", port=7777, reload=True)

Set up your virtual environment

uv venv --python 3.12
source .venv/bin/activate

Install dependencies

uv pip install -U agno openai uvicorn

Export your OpenAI API key

export OPENAI_API_KEY="your_openai_api_key_here"

Run the server

python background_output_evaluation.py

Test the endpoint

curl -X POST http://localhost:7777/agents/support-agent/runs \
  -F "message=How do I reset my password?" \
  -F "stream=false"

The response will be returned immediately. The evaluation runs in the background and results are stored in the database.

What Happens

User sends a request to the agent
The agent processes and generates a response
The response is sent to the user immediately
Background evaluation runs:
- AgentAsJudgeEval automatically evaluates the response against the criteria
- Scores the response on a scale of 1-10
- Stores results in the database

Production Extensions

In production, you could extend this pattern to:

Extension	Description
Database Storage	Store evaluations for analytics dashboards
Alerting	Use `on_fail` callback to send alerts when evaluations fail
Observability	Log to platforms like Datadog or OpenTelemetry
A/B Testing	Compare response quality across model versions
Training Data	Build datasets for fine-tuning

Background evaluation is ideal for quality monitoring without impacting user experience. For scenarios where you need to block bad responses, use synchronous hooks instead.

Global Background Hooks

Run all hooks as background tasks

Per-Hook Background

Mix synchronous and background hooks

Get Started

AgentOS

Example Usage

Background Output Evaluation

What Happens

Production Extensions

Global Background Hooks

Per-Hook Background

Get Started

AgentOS

Example Usage

​What Happens

​Production Extensions

​Related Examples

Global Background Hooks

Per-Hook Background

What Happens

Production Extensions

Related Examples