Learn how to use response caching to improve performance and reduce costs during development and testing.
When you are developing or testing new features, it is typical to hit the model with the same query multiple times. In these cases you normally don’t need the model to generate the same answer, and can cache the response to save on tokens.Response caching allows you to cache model responses locally, to avoid repeated API calls and reduce costs when the same query is made multiple times.
Response Caching vs. Prompt Caching: Response caching (covered here) caches the entire model response locally to avoid API calls. Prompt caching caches the system prompt on the model provider’s side to reduce processing time and costs.
Enable response caching by setting cache_response=True when initializing your model:
Copy
Ask AI
from agno.agent import Agentfrom agno.models.openai import OpenAIChatagent = Agent( model=OpenAIChat( id="gpt-4o", cache_response=True # Enable response caching ))# First call - cache miss, calls the APIresponse = agent.run("What is the capital of France?")# Second identical call - cache hit, returns cached response instantlyresponse = agent.run("What is the capital of France?")
Responses can also be cached when using streaming. On cache hits, the entire response is returned as one chunk.
Copy
Ask AI
from agno.agent import Agentfrom agno.models.openai import OpenAIChatagent = Agent(model=OpenAIChat(id="gpt-4o", cache_response=True))for i in range(1, 3): print(f"\n{'=' * 60}") print( f"Run {i}" ) print(f"{'=' * 60}\n") agent.print_response("Write me a short story about a cat that can talk and solve problems.", stream=True)