Prompt caching can help reducing processing time and costs. Consider it if you are using the same prompt multiple times in any flow.

You can read more about prompt caching with Anthropic models here.

Usage

To use prompt caching in your Agno setup, pass the cache_system_prompt argument when initializing the Claude model:

from agno.agent import Agent
from agno.models.anthropic import Claude

agent = Agent(
    model=Claude(
        id="claude-3-5-sonnet-20241022",
        cache_system_prompt=True,
    ),
)

Notice that for prompt caching to work, the prompt needs to be of a certain length. You can read more about this on Anthropic’s docs.

Extended cache

You can also use Anthropic’s extended cache beta feature. This updates the cache duration from 5 minutes to 1 hour. To activate it, pass the extended_cache_time argument and the following beta header:

from agno.agent import Agent
from agno.models.anthropic import Claude

agent = Agent(
    model=Claude(
        id="claude-3-5-sonnet-20241022",
        default_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"},
        cache_system_prompt=True,
        extended_cache_time=True,
    ),
)

Working example

cookbook/models/anthropic/prompt_caching_extended.py
from pathlib import Path

from agno.agent import Agent
from agno.models.anthropic import Claude
from agno.utils.media import download_file

# Load an example large system message from S3. A large prompt like this would benefit from caching.
txt_path = Path(__file__).parent.joinpath("system_promt.txt")
download_file(
    "https://agno-public.s3.amazonaws.com/prompts/system_promt.txt",
    str(txt_path),
)
system_message = txt_path.read_text()

agent = Agent(
    model=Claude(
        id="claude-sonnet-4-20250514",
        default_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"}, # Set the beta header to use the extended cache time
        system_prompt=system_message,
        cache_system_prompt=True,  # Activate prompt caching for Anthropic to cache the system prompt
        extended_cache_time=True,  # Extend the cache time from the default to 1 hour
    ),
    system_message=system_message,
    markdown=True,
)


# First run - this will create the cache
response = agent.run(
    "Explain the difference between REST and GraphQL APIs with examples"
)
print(f"First run cache write tokens = {response.metrics['cache_write_tokens']}")  # type: ignore

# Second run - this will use the cached system prompt
response = agent.run(
    "What are the key principles of clean code and how do I apply them in Python?"
)
print(f"Second run cache read tokens = {response.metrics['cached_tokens']}")  # type: ignore