Similar to providing multimodal inputs, you can also get multimodal outputs from an agent.

Image Generation using a tool

The following example demonstrates how to generate an image using an OpenAI tool with an agent.
image_agent.py
from agno.agent import Agent
from agno.db.sqlite import SqliteDb
from agno.models.openai import OpenAIChat
from agno.tools.openai import OpenAITools
from agno.utils.media import save_base64_data

agent = Agent(
    model=OpenAIChat(id="gpt-5-mini"),
    db=SqliteDb(db_file="tmp/test.db"),
    tools=[OpenAITools(image_model="gpt-image-1")],
    add_history_to_context=True,
    markdown=True,
)

response = agent.run(
    "Generate a photorealistic image of a cozy coffee shop interior",
)

if response.images and response.images[0].content:
    save_base64_data(str(response.images[0].content), "tmp/coffee_shop.png")
The output of the tool generating a media also goes to the model’s input as a message so it has access to the media (image, audio, video) and can use it in the response. For example, if you say “Generate an image of a dog and tell me its color.” the model will have access to the image and can use it to describe the dog’s color in the response in the same run.That also means you can ask follow-up questions about the image, since it would be available in the history of the agent.

Developer Resources