Multimodal Agents
Agno agents support text, image, audio and video inputs and can generate text, image, audio and video outputs. For a complete overview, please checkout the compatibility matrix.
Multimodal inputs to an agent
Let’s create an agent that can understand images and make tool calls as needed
Image Agent
Run the agent:
Similar to images, you can also use audio and video as an input.
Audio Agent
Video Agent
Multimodal outputs from an agent
Similar to providing multimodal inputs, you can also get multimodal outputs from an agent.
Image Generation
The following example demonstrates how to generate an image using DALL-E with an agent.
Audio Response
The following example demonstrates how to obtain both text and audio responses from an agent. The agent will respond with text and audio bytes that can be saved to a file.
Multimodal inputs and outputs together
You can create Agents that can take multimodal inputs and return multimodal outputs. The following example demonstrates how to provide a combination of audio and text inputs to an agent and obtain both text and audio outputs.