Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agno.com/llms.txt

Use this file to discover all available pages before exploring further.

Every labeler on the other pages takes text. To label other modalities, change the input argument and the model. The schema and the output_schema pattern stay the same.
from typing import Literal

from agno.agent import Agent
from agno.media import Image
from agno.models.openai import OpenAIResponses
from pydantic import BaseModel, Field


class Classification(BaseModel):
    label: Literal["dog", "cat", "bird", "fish", "other"] = Field(
        ..., description="What kind of animal is in the image"
    )


agent = Agent(
    model=OpenAIResponses(id="gpt-5.5"),
    instructions="You classify images by animal type.",
    output_schema=Classification,
)

url = "https://upload.wikimedia.org/wikipedia/commons/4/4d/Cat_November_2010-1a.jpg"
result = agent.run("Classify this image.", images=[Image(url=url)]).content
# Classification(label='cat')

Input argument per modality

ModalityImportArgumentModel in the cookbook
Imagefrom agno.media import Imageimages=[Image(url=...)]OpenAIResponses(id="gpt-5.5")
Audiofrom agno.media import Audioaudio=[Audio(content=...)]Gemini(id="gemini-3-flash-preview")
Videofrom agno.media import Videovideos=[Video(content=..., format="mp4")]Gemini(id="gemini-3-flash-preview")
PDFfrom agno.media import Filefiles=[File(url=...)]OpenAIResponses(id="gpt-5.5")
Image and File accept a url. Audio and Video take raw bytes via content; fetch them first.
import requests
from agno.media import Audio

audio_bytes = requests.get("https://example.com/clip.mp3").content
agent.run("Transcribe this.", audio=[Audio(content=audio_bytes)])

Bounding boxes

For region detection, return normalized coordinates so the result is resolution-independent.
from pydantic import BaseModel, Field


class BoundingBox(BaseModel):
    label: str = Field(..., description="What the box contains")
    x: float = Field(..., ge=0.0, le=1.0, description="Top-left x in [0, 1]")
    y: float = Field(..., ge=0.0, le=1.0, description="Top-left y in [0, 1]")
    width: float = Field(..., ge=0.0, le=1.0, description="Width in [0, 1]")
    height: float = Field(..., ge=0.0, le=1.0, description="Height in [0, 1]")
The per-field description on x, y, width, and height is load-bearing. Without it, and without the [0, 1] convention spelled out in the instructions, models return degenerate boxes (all-zero or whole-image). Spell out the coordinate system in both places.

Transcription and diarization

Audio extraction covers transcription, speaker diarization, and timestamped segments. Each is a schema change, not a different API.
OutputSchema shape
Flat transcript{ text: str }
Speaker turns{ turns: List[{ speaker, text }] }
Timestamped segments{ segments: List[{ start_seconds, end_seconds, text }] }

Model choice

Pick the model that handles the modality natively. The cookbook defaults to gemini-3-flash-preview for audio and video, and gpt-5.5 for image and PDF. Each cookbook README notes alternatives.

Next steps

TaskGuide
Define the output schemaStructured extraction
Assign labels to mediaClassification
Review media labelsQuality pipeline

Developer Resources