Skip to main content
Build an Agent that transcribes audio files into structured data with speaker identification and conversation metadata. This Agent uses multimodal capabilities to convert audio content into structured transcription data.

What You’ll Learn

By building this agent, you’ll understand:
  • How to use Pydantic schemas for structured transcription output
  • How to configure multimodal Agent for audio processing
  • How to identify speakers in audio conversations

Use Cases

Transcribe meeting recordings with speaker identification, convert podcast episodes into searchable text, create subtitles for video content, or build voice note analyzers with structured transcription data.

How It Works

The Agent uses multimodal capabilities to process audio directly and output structured transcription data:
  1. Input: Accepts audio files WAV format from URLs (or local files)
  2. Process: Multimodal model analyzes audio content and identifies speakers
  3. Structure: Output is validated against a Pydantic schema
  4. Output: Returns typed data with transcript, description, and speaker list
The structured output makes transcriptions immediately usable in downstream applications without additional parsing.

STT example using Gemini flash 3 preview

speech_to_text_agent.py
import httpx
from agno.agent import Agent, RunOutput  # noqa
from agno.media import Audio
from agno.models.google import Gemini
from pydantic import BaseModel, Field

INSTRUCTIONS = """
Transcribe the audio accurately and completely.

Speaker identification:
- Use the speaker's name if mentioned in the conversation
- Otherwise use 'Speaker 1', 'Speaker 2', etc. consistently

Non-speech audio:
- Note significant non-speech elements (e.g., [long pause], [music], [background noise]) only when relevant to understanding the conversation
- Ignore brief natural pauses

Include everything spoken, even false starts and filler words (um, uh, etc.).
"""


class Utterance(BaseModel):
    speaker: str = Field(..., description="Name or identifier of the speaker")
    text: str = Field(..., description="What was said by the speaker")


class Transcription(BaseModel):
    description: str = Field(..., description="A description of the audio conversation")
    utterances: list[Utterance] = Field(
        ..., description="Sequential list of utterances in conversation order"
    )


# Fetch the audio file and convert it to a base64 encoded string
# Simple audio file with a single speaker
# url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
# Audio file with multiple speakers
url = "https://agno-public.s3.us-east-1.amazonaws.com/demo_data/sample_audio.wav"

try:
    response = httpx.get(url)
    response.raise_for_status()
    wav_data = response.content
except httpx.HTTPStatusError as e:
    raise ValueError(f"Error fetching audio file: {url}") from e

# Provide the agent with the audio file and get result as text
agent = Agent(
    model=Gemini(id="gemini-3-flash-preview"),
    markdown=True,
    instructions=INSTRUCTIONS,
    output_schema=Transcription,
)

agent.print_response(
    "Give a transcript of the audio conversation",
    audio=[Audio(content=wav_data)],
)

STT example using OpenAI gpt-audio

speech_to_text_agent.py
import httpx
from agno.agent import Agent, RunOutput  # noqa
from agno.media import Audio
from agno.models.openai import OpenAIChat
from pydantic import BaseModel, Field

INSTRUCTIONS = """
Transcribe the audio accurately and completely.

Speaker identification:
- Use the speaker's name if mentioned in the conversation
- Otherwise use 'Speaker 1', 'Speaker 2', etc. consistently

Non-speech audio:
- Note significant non-speech elements (e.g., [long pause], [music], [background noise]) only when relevant to understanding the conversation
- Ignore brief natural pauses

Include everything spoken, even false starts and filler words (um, uh, etc.).
"""


class Utterance(BaseModel):
    speaker: str = Field(..., description="Name or identifier of the speaker")
    text: str = Field(..., description="What was said by the speaker")


class Transcription(BaseModel):
    description: str = Field(..., description="A description of the audio conversation")
    utterances: list[Utterance] = Field(
        ..., description="Sequential list of utterances in conversation order"
    )


# Fetch the audio file and convert it to a base64 encoded string
# Simple audio file with a single speaker
# url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
# Audio file with multiple speakers
url = "https://agno-public.s3.us-east-1.amazonaws.com/demo_data/sample_audio.wav"

try:
    response = httpx.get(url)
    response.raise_for_status()
    wav_data = response.content
except httpx.HTTPStatusError as e:
    raise ValueError(f"Error fetching audio file: {url}") from e

# Provide the agent with the audio file and get result as text
agent = Agent(
    model=OpenAIChat(id="gpt-audio-2025-08-28", modalities=["text"]),
    markdown=True,
    instructions=INSTRUCTIONS,
    output_schema=Transcription,
    # We use a parser model here as gpt-audio-2025-08-28 cannot return structured output by itself
    parser_model=OpenAIChat(id="gpt-5-mini"),
)

agent.print_response(
    "Give a transcript of the audio conversation",
    audio=[Audio(content=wav_data, format="wav")],
)

What to Expect

The agent processes audio files and returns a structured Transcription object containing:
  • description: A summary describing what the audio is about
  • utterances: List of identified speakers (names if mentioned, otherwise “Speaker 1”, “Speaker 2”, etc.)
The utterances are in the order of the audio conversation and they contain:
  • speaker: Name or identifier of the speaker
  • text: What was said by the speaker
Processing time depends on audio length, typically 10-30 seconds for files under 5 minutes.

Usage

1

Create a virtual environment

Open the Terminal and create a python virtual environment.
python3 -m venv .venv
source .venv/bin/activate
2

Set your API key

bash export GOOGLE_API_KEY=xxx
3

Install libraries

bash pip install -U agno google-genai httpx
4

Run Agent

python speech_to_text_agent.py

Next Steps

  • Remove the structured output and use the text output instead if your use case does not require structured outputs
  • Extend the Transcription schema with additional fields like sentiment or topics
  • Try processing different audio formats (MP3, WAV, M4A)
  • Combine with other tools for enhanced analysis