What You’ll Learn
By building this agent, you’ll understand:- How to use Pydantic schemas for structured transcription output
- How to configure multimodal Agent for audio processing
- How to identify speakers in audio conversations
Use Cases
Transcribe meeting recordings with speaker identification, convert podcast episodes into searchable text, create subtitles for video content, or build voice note analyzers with structured transcription data.How It Works
The Agent uses multimodal capabilities to process audio directly and output structured transcription data:- Input: Accepts audio files WAV format from URLs (or local files)
- Process: Multimodal model analyzes audio content and identifies speakers
- Structure: Output is validated against a Pydantic schema
- Output: Returns typed data with transcript, description, and speaker list
STT example using Gemini flash 3 preview
speech_to_text_agent.py
STT example using OpenAI gpt-audio
speech_to_text_agent.py
What to Expect
The agent processes audio files and returns a structuredTranscription object containing:
- description: A summary describing what the audio is about
- utterances: List of identified speakers (names if mentioned, otherwise “Speaker 1”, “Speaker 2”, etc.)
- speaker: Name or identifier of the speaker
- text: What was said by the speaker
Usage
1
Create a virtual environment
Open the
Terminal and create a python virtual environment.2
Set your API key
bash export GOOGLE_API_KEY=xxx 3
Install libraries
bash pip install -U agno google-genai httpx 4
Run Agent
Next Steps
- Remove the structured output and use the text output instead if your use case does not require structured outputs
- Extend the
Transcriptionschema with additional fields likesentimentortopics - Try processing different audio formats (MP3, WAV, M4A)
- Combine with other tools for enhanced analysis