Similar to providing multimodal inputs, you can also get multimodal outputs from an agent.
The following example demonstrates how to generate an image using an OpenAI tool with an agent.
from agno.agent import Agent
from agno.db.sqlite import SqliteDb
from agno.models.openai import OpenAIResponses
from agno.tools.openai import OpenAITools
from agno.utils.media import save_base64_data
agent = Agent(
model=OpenAIResponses(id="gpt-5.2"),
db=SqliteDb(db_file="tmp/test.db"),
tools=[OpenAITools(image_model="gpt-image-1")],
add_history_to_context=True,
markdown=True,
)
response = agent.run(
"Generate a photorealistic image of a cozy coffee shop interior",
)
if response.images and response.images[0].content:
save_base64_data(str(response.images[0].content), "tmp/coffee_shop.png")
The output of the tool generating a media also goes to the model’s input as a
message so it has access to the media (image, audio, video) and can use it in
the response. For example, if you say “Generate an image of a dog and tell me
its color.” the model will have access to the image and can use it to describe
the dog’s color in the response in the same run.That also means you can ask follow-up questions about the image, since it would be available in the history of the agent.