Agno agents support text, image, audio and video inputs and can generate text, image, audio and video outputs. For a complete overview, please checkout the compatibility matrix.
from agno.agent import Agentfrom agno.media import Imagefrom agno.models.openai import OpenAIChatfrom agno.tools.duckduckgo import DuckDuckGoToolsagent = Agent( model=OpenAIChat(id="gpt-5-mini"), tools=[DuckDuckGoTools()], markdown=True,)agent.print_response( "Tell me about this image and give me the latest news about it.", images=[ Image( url="https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg" ) ], stream=True,)
Run the agent:
Copy
Ask AI
python image_agent.py
Similar to images, you can also use audio and video as an input.
Similar to providing multimodal inputs, you can also get multimodal outputs from an agent.You can either use tools to generate image/audio/video or use the agent’s model to generate them (if the model supports this capability).
The following example demonstrates how to generate an image using an OpenAI tool with an agent.
image_agent.py
Copy
Ask AI
from agno.agent import Agentfrom agno.models.openai import OpenAIChatfrom agno.tools.openai import OpenAIToolsfrom agno.utils.media import save_base64_dataagent = Agent( model=OpenAIChat(id="gpt-5-mini"), tools=[OpenAITools(image_model="gpt-image-1")], markdown=True,)response = agent.run( "Generate a photorealistic image of a cozy coffee shop interior",)if response.images and response.images[0].content: save_base64_data(str(response.images[0].content), "tmp/coffee_shop.png")
The output of the tool generating a media also goes to the model’s input as a
message so it has access to the media (image, audio, video) and can use it in
the response. For example, if you say “Generate an image of a dog and tell me
its color.” the model will have access to the image and can use it to describe
the dog’s color in the response in the same run.
The following example demonstrates how some models can directly generate images as part of their response.
image_agent.py
Copy
Ask AI
from io import BytesIOfrom agno.agent import Agent, RunOutput # noqafrom agno.models.google import Geminifrom PIL import Image# No system message should be providedagent = Agent( model=Gemini( id="gemini-2.0-flash-exp-image-generation", response_modalities=["Text", "Image"], # This means to generate both images and text ))# Print the response in the terminalrun_response = agent.run("Make me an image of a cat in a tree.")if run_response and isinstance(run_response, RunOutput) and run_response.images: for image_response in run_response.images: image_bytes = image_response.content if image_bytes: image = Image.open(BytesIO(image_bytes)) image.show() # Save the image to a file # image.save("generated_image.png")else: print("No images found in run response")
You can find all generated images in the RunOutput.images list.
The following example demonstrates how to generate an audio using the ElevenLabs tool with an agent. See Eleven Labs for more details.
audio_agent.py
Copy
Ask AI
import base64from agno.agent import Agentfrom agno.models.google import Geminifrom agno.tools.eleven_labs import ElevenLabsToolsfrom agno.utils.media import save_base64_dataaudio_agent = Agent( model=Gemini(id="gemini-2.5-pro"), tools=[ ElevenLabsTools( voice_id="21m00Tcm4TlvDq8ikWAM", model_id="eleven_multilingual_v2", target_directory="audio_generations", ) ], description="You are an AI agent that can generate audio using the ElevenLabs API.", instructions=[ "When the user asks you to generate audio, use the `generate_audio` tool to generate the audio.", "You'll generate the appropriate prompt to send to the tool to generate audio.", "You don't need to find the appropriate voice first, I already specified the voice to user." "Return the audio file name in your response. Don't convert it to markdown.", "The audio should be long and detailed.", ], markdown=True,)response = audio_agent.run( "Generate a very long audio of history of french revolution and tell me which subject it belongs to.", debug_mode=True,)if response.audio: print("Agent response:", response.content) base64_audio = base64.b64encode(response.audio[0].content).decode("utf-8") save_base64_data(base64_audio, "tmp/french_revolution.mp3") print("Successfully saved generated speech to tmp/french_revolution.mp3")audio_agent.print_response("Generate a kick sound effect")
The following example demonstrates how some models can directly generate audio as part of their response.
audio_agent.py
Copy
Ask AI
from agno.agent import Agent, RunOutputfrom agno.models.openai import OpenAIChatfrom agno.utils.audio import write_audio_to_fileagent = Agent( model=OpenAIChat( id="gpt-5-mini-audio-preview", modalities=["text", "audio"], audio={"voice": "alloy", "format": "wav"}, ), markdown=True,)response: RunOutput = agent.run("Tell me a 5 second scary story")# Save the response audio to a fileif response.response_audio is not None: write_audio_to_file( audio=agent.run_response.response_audio.content, filename="tmp/scary_story.wav" )
The following example demonstrates how to generate a video using FalTools with an agent. See FAL for more details.
video_agent.py
Copy
Ask AI
from agno.agent import Agentfrom agno.models.openai import OpenAIChatfrom agno.tools.fal import FalToolsfal_agent = Agent( name="Fal Video Generator Agent", model=OpenAIChat(id="gpt-5-mini"), tools=[ FalTools( model="fal-ai/hunyuan-video", enable_generate_media=True, ) ], description="You are an AI agent that can generate videos using the Fal API.", instructions=[ "When the user asks you to create a video, use the `generate_media` tool to create the video.", "Return the URL as raw to the user.", "Don't convert video URL to markdown or anything else.", ], markdown=True,)fal_agent.print_response("Generate video of balloon in the ocean")
You can create agents that can take multimodal inputs and return multimodal outputs. The following example demonstrates how to provide a combination of audio and text inputs to an agent and obtain both text and audio outputs.