Quality pipeline

A single model is one point of failure. Run two labelers from different providers, diff their outputs, and adjudicate only where they disagree. The disagreement record is the audit trail.

from typing import List, Optional

from agno.agent import Agent
from agno.models.anthropic import Claude
from agno.models.google import Gemini
from pydantic import BaseModel, Field


class Contact(BaseModel):
    name: Optional[str] = None
    email: Optional[str] = None
    company: Optional[str] = None


class FieldDisagreement(BaseModel):
    field: str = Field(..., description="Top-level Contact field name")
    value_a: Optional[str] = None
    value_b: Optional[str] = None
    reason: str = Field(..., description="Why this field needs adjudication")


class DisagreementReport(BaseModel):
    disagreements: List[FieldDisagreement] = Field(default_factory=list)
    needs_adjudication: bool = Field(..., description="True if any field disagrees")


class FinalLabel(BaseModel):
    contact: Contact
    notes: Optional[str] = None


LABELER = "Extract contact info. Use exactly what the text shows. Null if missing."

labeler_a = Agent(model=Gemini(id="gemini-3.5-flash"), instructions=LABELER, output_schema=Contact)
labeler_b = Agent(model=Claude(id="claude-opus-4-7"), instructions=LABELER, output_schema=Contact)

reviewer = Agent(
    model=Claude(id="claude-opus-4-7"),
    instructions=(
        "Compare two labelers' Contact outputs field by field. A field "
        "needs adjudication when both are non-null but differ. Set "
        "needs_adjudication=true if any field does."
    ),
    output_schema=DisagreementReport,
)

adjudicator = Agent(
    model=Claude(id="claude-opus-4-7"),
    instructions=(
        "Re-read the original text and resolve every reported "
        "disagreement. Return a FinalLabel with the correct values."
    ),
    output_schema=FinalLabel,
)


def label_with_quality_review(text: str) -> FinalLabel:
    a = labeler_a.run(text).content
    b = labeler_b.run(text).content

    report = reviewer.run(
        f"Labeler A:\n{a.model_dump_json()}\n\nLabeler B:\n{b.model_dump_json()}"
    ).content

    if not report.needs_adjudication:
        return FinalLabel(contact=a, notes="Labelers agreed.")

    return adjudicator.run(
        f"Original input:\n{text}\n\n"
        f"Labeler A:\n{a.model_dump_json()}\n\n"
        f"Labeler B:\n{b.model_dump_json()}\n\n"
        f"Reviewer report:\n{report.model_dump_json()}"
    ).content

The flow

Two labelers, two providers. Provider diversity is the signal. When Gemini and Anthropic disagree, that is where the document is ambiguous.
Reviewer diffs them. It emits one FieldDisagreement per conflicting field and a single needs_adjudication flag. Agreement short-circuits the expensive step.
Adjudicator runs only on disagreement. It re-reads the original input with both labels and the reviewer’s report, then returns the final record.

The DisagreementReport is worth persisting. Disagreement rate by field, by vendor, and over time tells you where the schema or the prompt is weak.

Production composition

The example above is three sequential calls so the pattern is readable. For a million-document job, wrap labelers in a Parallel step and gate the adjudicator behind a Condition in a Workflow. See parallel workflows and conditional workflows.

Production checklist

Agno gives you the orchestration primitives. These concerns are yours to add, and the cookbook is explicit about each.

Concern	What to add
Rate limiting	Wrap `agent.arun` with a per-provider limiter, or front it with a gateway. Agno does not throttle outbound calls.
Bounded concurrency	An `asyncio.Semaphore` around the batch fan-out.
Dead-letter queue	Record failed item IDs and re-run them through a stricter pass.
Idempotency	A deterministic `session_id` per item, so re-runs upsert instead of duplicate.
Provider Batch APIs	For non-urgent jobs, call the provider batch endpoints directly for the discount. Not wrapped by Agno.
Prompt versioning	Track a `prompt_version` in run metadata so historical labels stay joinable.
Authoritative cost	`RunMetrics.cost` is populated only when the provider returns it. Attach a token-rate table downstream if you need exact numbers.

Next steps

Task	Guide
Build the labelers	Structured extraction
Compose as a workflow	Workflows
Run agents concurrently	Async execution

​The flow

​Production composition

​Production checklist

​Next steps

​Developer Resources

The flow

Production composition

Production checklist

Next steps

Developer Resources