top of page
Writer's pictureBruno Alho

Comparing Context Caching in LLMs: OpenAI vs. Anthropic vs. Google Gemini

Updated: Nov 4


As engineers working with large language models (LLMs), optimizing performance and cost is a constant challenge, especially when dealing with extensive context windows. Context caching emerges as a powerful solution, enabling the reuse of large prompts without incurring full costs on every request. This article provides a comparative analysis of how OpenAI, Anthropic, and Google Gemini implement context caching, focusing on the practical implications for your projects.


Overview of Prompt/context caching approach
Overview of Prompt/context caching approach

Quick Comparison

Here's an at-a-glance comparison of key features across the three providers:

Comparison of Prompt Caching Details per Provider
 

OpenAI: Convenience Over Cost


How It Works

  • Automatic Caching: OpenAI's context caching requires no additional parameters or API changes. It's integrated seamlessly into your existing workflow.


Pros

  • Ease of Use: No need to modify your code; caching happens behind the scenes.

  • No Extra Costs: Writing and storage are free, simplifying budgeting.


Cons

  • Lower Discount: Offers only a 50% discount on cached tokens—the lowest among the three.

  • Limited Storage Time: Cached data persists for 5 to 10 minutes, up to one hour.


Ideal For

  • Projects where simplicity and minimal setup are priorities.

  • Applications with short-lived sessions that don't require long-term caching.


Requests using Prompt Caching have a cached_tokens value within the usage field in the API response:

usage: {
total_tokens: 2306,
prompt_tokens: 2006,
completion_tokens: 300,
 
prompt_tokens_details: {
cached_tokens: 1920,
audio_tokens: 0,
  },
completion_tokens_details: {
reasoning_tokens: 0,
audio_tokens: 0,
  }
} 

 

Anthropic: Maximum Savings with Extra Steps


How It Works

  • Parameter-Based Caching: You need to add a specific cache parameter to your API calls to enable caching.


Pros

  • High Discount: Enjoy a 90% discount on cached tokens, the highest available.

  • No Storage Cost: Storage is free, reducing ongoing expenses.


Cons

  • Writing Cost: An extra 25% charge on input tokens when creating the cache.

  • Implementation Effort: Requires code changes and cost-benefit calculations.

  • Short Storage Time: Cache lasts only 5 minutes, necessitating rapid reuse.


Ideal For

  • Large-scale operations where significant savings justify the initial costs.

  • Use cases involving batch processing with minimal delays between requests.


Here’s an example of how to implement Prompt Caching using Anthropic models:


import anthropic

client = anthropic.Anthropic()

response = client.beta.prompt_caching.messages.create(
    model=" claude-3-haiku-20240307",
    max_tokens=1024,
    system=[
      {
        "type": "text", 
        "text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
      },
      {
        "type": "text", 
        "text": "<the entire contents of 'My book'>",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    messages=[{"role": "user", "content": "Analyze the major themes in 'My book'."}],
)
print(response)

 

Google Gemini: Flexibility at a Price


How It Works

  • API-Based Caching: Create and manage caches via API calls, allowing for customization.


Pros

  • Customizable Storage: Set your own expiration times based on project needs.

  • Moderate Discount: Offers a 75% discount on cached tokens.

  • Supports Large Contexts: Minimum of 32,768 tokens, suitable for extensive data.


Cons

  • Added Implementation Steps: Instead of just changing a field on the API request, you need to initialize the cache with your desired configurations.

  • Storage Costs: Charges apply based on token usage per hour.

  • High Minimum Tokens: Not ideal for smaller context sizes.


Ideal For

  • Applications needing long-term caching and custom expiration.

  • Projects that can handle very large contexts and require fine-grained control.



import os
import google.generativeai as genai
from google.generativeai import caching
import datetime
import time

# Get your API key from https://aistudio.google.com/app/apikey
# and access your API key as an environment variable.
# To authenticate from a Colab, see
# https://github.com/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb
genai.configure(api_key=os.environ['API_KEY'])

# Download video file
# curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4

path_to_video_file = 'Sherlock_Jr_FullMovie.mp4'

# Upload the video using the Files API
video_file = genai.upload_file(path=path_to_video_file)

# Wait for the file to finish processing
while video_file.state.name == 'PROCESSING':
  print('Waiting for video to be processed.')
  time.sleep(2)
  video_file = genai.get_file(video_file.name)

print(f'Video processing complete: {video_file.uri}')

# Create a cache with a 5 minute TTL
cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='sherlock jr movie', # used to identify the cache
    system_instruction=(
        'You are an expert video analyzer, and your job is to answer '
        'the user\'s query based on the video file you have access to.'
    ),
    contents=[video_file],
    ttl=datetime.timedelta(minutes=5),
)

# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# Query the model
response = model.generate_content([(
    'Introduce different characters in the movie by describing '
    'their personality, looks, and names. Also list the timestamps '
    'they were introduced for the first time.')])

print(response.usage_metadata)

# The output should look something like this:
#
# prompt_token_count: 696219
# cached_content_token_count: 696190
# candidates_token_count: 214
# total_token_count: 696433

print(response.text)


Choosing the Right Provider for Your Project


Considerations for Engineers

  1. Ease of Integration:

    • OpenAI is the best choice if you want a plug-and-play solution with minimal code changes.

    • Anthropic and Google Gemini require additional coding effort.

  2. Cost Efficiency:

    • Anthropic offers the highest discount but includes an initial writing cost.

    • Google Gemini provides a good discount but adds storage costs.

    • OpenAI has the lowest discount but no extra costs.

  3. Caching Duration:

    • Google Gemini allows for customizable cache durations.

    • OpenAI and Anthropic offer limited caching times.

  4. Context Size:

    • Google Gemini is suitable for very large contexts (32,768+ tokens).

    • OpenAI and Anthropic have lower minimum token requirements.

  5. Project Requirements:

    • For short-term projects or those requiring quick setup, OpenAI is preferable.

    • For cost-sensitive, large-scale projects, Anthropic may be the better option.

    • For projects needing customization and handling large data, Google Gemini is ideal.


Practical Scenarios


  • Chatbots with Extensive Prompts:

    • OpenAI simplifies deployment with automatic caching.

    • Anthropic can reduce costs significantly if you're willing to manage caching parameters.

  • Document Analysis:

    • Google Gemini excels with large documents due to its high token limit and customizable caching.


 


Final Thoughts


Selecting the right context caching provider depends on balancing ease of use, cost savings, and project requirements. OpenAI offers simplicity, Anthropic provides maximum discounts, and Google Gemini delivers flexibility.

As engineers, you should assess:

  • Your team's capacity to implement and manage caching features.

  • Budget constraints and how they align with potential savings.

  • Application needs, particularly regarding context size and caching duration.

By carefully considering these factors, you can choose the provider that best fits your project's needs, optimizing both performance and cost.

Comments


Sign up to get updates when we release another amazing article

Thanks for subscribing!

bottom of page