Speech-to-Text Agent

Build an Agent that transcribes audio files into structured data with speaker identification and conversation metadata. This Agent uses multimodal capabilities to convert audio content into structured transcription data.

What You’ll Learn

By building this agent, you’ll understand:

How to use Pydantic schemas for structured transcription output
How to configure multimodal Agent for audio processing
How to identify speakers in audio conversations

Use Cases

Transcribe meeting recordings with speaker identification, convert podcast episodes into searchable text, create subtitles for video content, or build voice note analyzers with structured transcription data.

How It Works

The Agent uses multimodal capabilities to process audio directly and output structured transcription data:

Input: Accepts audio files WAV format from URLs (or local files)
Process: Multimodal model analyzes audio content and identifies speakers
Structure: Output is validated against a Pydantic schema
Output: Returns typed data with transcript, description, and speaker list

The structured output makes transcriptions immediately usable in downstream applications without additional parsing.

STT example using Gemini flash 3 preview

speech_to_text_agent.py

import httpx
from agno.agent import Agent, RunOutput  # noqa
from agno.media import Audio
from agno.models.google import Gemini
from pydantic import BaseModel, Field

INSTRUCTIONS = """
Transcribe the audio accurately and completely.

Speaker identification:
- Use the speaker's name if mentioned in the conversation
- Otherwise use 'Speaker 1', 'Speaker 2', etc. consistently

Non-speech audio:
- Note significant non-speech elements (e.g., [long pause], [music], [background noise]) only when relevant to understanding the conversation
- Ignore brief natural pauses

Include everything spoken, even false starts and filler words (um, uh, etc.).
"""


class Utterance(BaseModel):
    speaker: str = Field(..., description="Name or identifier of the speaker")
    text: str = Field(..., description="What was said by the speaker")


class Transcription(BaseModel):
    description: str = Field(..., description="A description of the audio conversation")
    utterances: list[Utterance] = Field(
        ..., description="Sequential list of utterances in conversation order"
    )


# Fetch the audio file and convert it to a base64 encoded string
# Simple audio file with a single speaker
# url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
# Audio file with multiple speakers
url = "https://agno-public.s3.us-east-1.amazonaws.com/demo_data/sample_audio.wav"

try:
    response = httpx.get(url)
    response.raise_for_status()
    wav_data = response.content
except httpx.HTTPStatusError as e:
    raise ValueError(f"Error fetching audio file: {url}") from e

# Provide the agent with the audio file and get result as text
agent = Agent(
    model=Gemini(id="gemini-3-flash-preview"),
    markdown=True,
    instructions=INSTRUCTIONS,
    output_schema=Transcription,
)

agent.print_response(
    "Give a transcript of the audio conversation",
    audio=[Audio(content=wav_data)],
)

STT example using OpenAI gpt-audio

speech_to_text_agent.py

import httpx
from agno.agent import Agent, RunOutput  # noqa
from agno.media import Audio
from agno.models.openai import OpenAIChat
from pydantic import BaseModel, Field

INSTRUCTIONS = """
Transcribe the audio accurately and completely.

Speaker identification:
- Use the speaker's name if mentioned in the conversation
- Otherwise use 'Speaker 1', 'Speaker 2', etc. consistently

Non-speech audio:
- Note significant non-speech elements (e.g., [long pause], [music], [background noise]) only when relevant to understanding the conversation
- Ignore brief natural pauses

Include everything spoken, even false starts and filler words (um, uh, etc.).
"""


class Utterance(BaseModel):
    speaker: str = Field(..., description="Name or identifier of the speaker")
    text: str = Field(..., description="What was said by the speaker")


class Transcription(BaseModel):
    description: str = Field(..., description="A description of the audio conversation")
    utterances: list[Utterance] = Field(
        ..., description="Sequential list of utterances in conversation order"
    )


# Fetch the audio file and convert it to a base64 encoded string
# Simple audio file with a single speaker
# url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
# Audio file with multiple speakers
url = "https://agno-public.s3.us-east-1.amazonaws.com/demo_data/sample_audio.wav"

try:
    response = httpx.get(url)
    response.raise_for_status()
    wav_data = response.content
except httpx.HTTPStatusError as e:
    raise ValueError(f"Error fetching audio file: {url}") from e

# Provide the agent with the audio file and get result as text
agent = Agent(
    model=OpenAIChat(id="gpt-audio-2025-08-28", modalities=["text"]),
    markdown=True,
    instructions=INSTRUCTIONS,
    output_schema=Transcription,
    # We use a parser model here as gpt-audio-2025-08-28 cannot return structured output by itself
    parser_model=OpenAIChat(id="gpt-5-mini"),
)

agent.print_response(
    "Give a transcript of the audio conversation",
    audio=[Audio(content=wav_data, format="wav")],
)

What to Expect

The agent processes audio files and returns a structured Transcription object containing:

description: A summary describing what the audio is about
utterances: List of identified speakers (names if mentioned, otherwise “Speaker 1”, “Speaker 2”, etc.)

The utterances are in the order of the audio conversation and they contain:

speaker: Name or identifier of the speaker
text: What was said by the speaker

Processing time depends on audio length, typically 10-30 seconds for files under 5 minutes.

Usage

Create a virtual environment

Open the Terminal and create a python virtual environment.

python3 -m venv .venv
source .venv/bin/activate

Set your API key

bash export GOOGLE_API_KEY=xxx

Install libraries

bash pip install -U agno google-genai httpx

Run Agent

python speech_to_text_agent.py

Next Steps

Remove the structured output and use the text output instead if your use case does not require structured outputs
Extend the Transcription schema with additional fields like sentiment or topics
Try processing different audio formats (MP3, WAV, M4A)
Combine with other tools for enhanced analysis

Examples

Agent Examples

Team Examples

Workflow Examples

Speech-to-Text Agent

What You’ll Learn

Use Cases

How It Works

STT example using Gemini flash 3 preview

STT example using OpenAI gpt-audio

What to Expect

Usage

Next Steps

Examples

Agent Examples

Team Examples

Workflow Examples

​What You’ll Learn

​Use Cases

​How It Works

​STT example using Gemini flash 3 preview

​STT example using OpenAI gpt-audio

​What to Expect

​Usage

​Next Steps

What You’ll Learn

Use Cases

How It Works

STT example using Gemini flash 3 preview

STT example using OpenAI gpt-audio

What to Expect

Usage

Next Steps