Comparing Context Caching in LLMs: OpenAI vs. Anthropic vs. Google Gemini

Bruno Alho
Oct 14, 2024
4 min read

Updated: Nov 4, 2024

As engineers working with large language models (LLMs), optimizing performance and cost is a constant challenge, especially when dealing with extensive context windows. Context caching emerges as a powerful solution, enabling the reuse of large prompts without incurring full costs on every request. This article provides a comparative analysis of how OpenAI, Anthropic, and Google Gemini implement context caching, focusing on the practical implications for your projects.

Overview of Prompt/context caching approach

Quick Comparison

Here's an at-a-glance comparison of key features across the three providers:

Comparison of Prompt Caching Details per Provider

OpenAI: Convenience Over Cost

How It Works

Automatic Caching: OpenAI's context caching requires no additional parameters or API changes. It's integrated seamlessly into your existing workflow.

Pros

Ease of Use: No need to modify your code; caching happens behind the scenes.
No Extra Costs: Writing and storage are free, simplifying budgeting.

Cons

Lower Discount: Offers only a 50% discount on cached tokens—the lowest among the three.
Limited Storage Time: Cached data persists for 5 to 10 minutes, up to one hour.

Ideal For

Projects where simplicity and minimal setup are priorities.
Applications with short-lived sessions that don't require long-term caching.

Requests using Prompt Caching have a cached_tokens value within the usage field in the API response:

usage: {
total_tokens: 2306,
prompt_tokens: 2006,
completion_tokens: 300,
 
prompt_tokens_details: {
cached_tokens: 1920,
audio_tokens: 0,
  },
completion_tokens_details: {
reasoning_tokens: 0,
audio_tokens: 0,
  }
}

Anthropic: Maximum Savings with Extra Steps

How It Works

Parameter-Based Caching: You need to add a specific cache parameter to your API calls to enable caching.

Pros

High Discount: Enjoy a 90% discount on cached tokens, the highest available.
No Storage Cost: Storage is free, reducing ongoing expenses.

Cons

Writing Cost: An extra 25% charge on input tokens when creating the cache.
Implementation Effort: Requires code changes and cost-benefit calculations.
Short Storage Time: Cache lasts only 5 minutes, necessitating rapid reuse.

Ideal For

Large-scale operations where significant savings justify the initial costs.
Use cases involving batch processing with minimal delays between requests.

Here’s an example of how to implement Prompt Caching using Anthropic models:


import anthropic

client = anthropic.Anthropic()

response = client.beta.prompt_caching.messages.create(
    model=" claude-3-haiku-20240307",
    max_tokens=1024,
    system=[
      {
        "type": "text", 
        "text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
      },
      {
        "type": "text", 
        "text": "<the entire contents of 'My book'>",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    messages=[{"role": "user", "content": "Analyze the major themes in 'My book'."}],
)
print(response)

Google Gemini: Flexibility at a Price

How It Works

API-Based Caching: Create and manage caches via API calls, allowing for customization.

Pros

Customizable Storage: Set your own expiration times based on project needs.
Moderate Discount: Offers a 75% discount on cached tokens.
Supports Large Contexts: Minimum of 32,768 tokens, suitable for extensive data.

Cons

Added Implementation Steps: Instead of just changing a field on the API request, you need to initialize the cache with your desired configurations.
Storage Costs: Charges apply based on token usage per hour.
High Minimum Tokens: Not ideal for smaller context sizes.

Ideal For

Applications needing long-term caching and custom expiration.
Projects that can handle very large contexts and require fine-grained control.


import os
import google.generativeai as genai
from google.generativeai import caching
import datetime
import time

# Get your API key from https://aistudio.google.com/app/apikey
# and access your API key as an environment variable.
# To authenticate from a Colab, see
# https://github.com/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb
genai.configure(api_key=os.environ['API_KEY'])

# Download video file
# curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4

path_to_video_file = 'Sherlock_Jr_FullMovie.mp4'

# Upload the video using the Files API
video_file = genai.upload_file(path=path_to_video_file)

# Wait for the file to finish processing
while video_file.state.name == 'PROCESSING':
  print('Waiting for video to be processed.')
  time.sleep(2)
  video_file = genai.get_file(video_file.name)

print(f'Video processing complete: {video_file.uri}')

# Create a cache with a 5 minute TTL
cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='sherlock jr movie', # used to identify the cache
    system_instruction=(
        'You are an expert video analyzer, and your job is to answer '
        'the user\'s query based on the video file you have access to.'
    ),
    contents=[video_file],
    ttl=datetime.timedelta(minutes=5),
)

# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# Query the model
response = model.generate_content([(
    'Introduce different characters in the movie by describing '
    'their personality, looks, and names. Also list the timestamps '
    'they were introduced for the first time.')])

print(response.usage_metadata)

# The output should look something like this:
#
# prompt_token_count: 696219
# cached_content_token_count: 696190
# candidates_token_count: 214
# total_token_count: 696433

print(response.text)

Choosing the Right Provider for Your Project

Considerations for Engineers

Ease of Integration:
- OpenAI is the best choice if you want a plug-and-play solution with minimal code changes.
- Anthropic and Google Gemini require additional coding effort.
Cost Efficiency:
- Anthropic offers the highest discount but includes an initial writing cost.
- Google Gemini provides a good discount but adds storage costs.
- OpenAI has the lowest discount but no extra costs.
Caching Duration:
- Google Gemini allows for customizable cache durations.
- OpenAI and Anthropic offer limited caching times.
Context Size:
- Google Gemini is suitable for very large contexts (32,768+ tokens).
- OpenAI and Anthropic have lower minimum token requirements.
Project Requirements:
- For short-term projects or those requiring quick setup, OpenAI is preferable.
- For cost-sensitive, large-scale projects, Anthropic may be the better option.
- For projects needing customization and handling large data, Google Gemini is ideal.

Practical Scenarios

Chatbots with Extensive Prompts:
- OpenAI simplifies deployment with automatic caching.
- Anthropic can reduce costs significantly if you're willing to manage caching parameters.
Document Analysis:
- Google Gemini excels with large documents due to its high token limit and customizable caching.

Final Thoughts

Selecting the right context caching provider depends on balancing ease of use, cost savings, and project requirements. OpenAI offers simplicity, Anthropic provides maximum discounts, and Google Gemini delivers flexibility.

As engineers, you should assess:

Your team's capacity to implement and manage caching features.
Budget constraints and how they align with potential savings.
Application needs, particularly regarding context size and caching duration.

By carefully considering these factors, you can choose the provider that best fits your project's needs, optimizing both performance and cost.

Comparing Context Caching in LLMs: OpenAI vs. Anthropic vs. Google Gemini

Quick Comparison

OpenAI: Convenience Over Cost

How It Works

Pros

Cons

Ideal For

Anthropic: Maximum Savings with Extra Steps

How It Works

Pros

Cons

Ideal For

Google Gemini: Flexibility at a Price

How It Works

Pros

Cons

Ideal For

Choosing the Right Provider for Your Project

Considerations for Engineers

Practical Scenarios

Final Thoughts

Related Posts

Sign up to get updates when we release another amazing article