As engineers working with large language models (LLMs), optimizing performance and cost is a constant challenge, especially when dealing with extensive context windows. Context caching emerges as a powerful solution, enabling the reuse of large prompts without incurring full costs on every request. This article provides a comparative analysis of how OpenAI, Anthropic, and Google Gemini implement context caching, focusing on the practical implications for your projects.
Quick Comparison
Here's an at-a-glance comparison of key features across the three providers:
OpenAI: Convenience Over Cost
How It Works
Automatic Caching: OpenAI's context caching requires no additional parameters or API changes. It's integrated seamlessly into your existing workflow.
Pros
Ease of Use: No need to modify your code; caching happens behind the scenes.
No Extra Costs: Writing and storage are free, simplifying budgeting.
Cons
Lower Discount: Offers only a 50% discount on cached tokens—the lowest among the three.
Limited Storage Time: Cached data persists for 5 to 10 minutes, up to one hour.
Ideal For
Projects where simplicity and minimal setup are priorities.
Applications with short-lived sessions that don't require long-term caching.
Requests using Prompt Caching have a cached_tokens value within the usage field in the API response:
usage: {
total_tokens: 2306,
prompt_tokens: 2006,
completion_tokens: 300,
prompt_tokens_details: {
cached_tokens: 1920,
audio_tokens: 0,
},
completion_tokens_details: {
reasoning_tokens: 0,
audio_tokens: 0,
}
}
Anthropic: Maximum Savings with Extra Steps
How It Works
Parameter-Based Caching: You need to add a specific cache parameter to your API calls to enable caching.
Pros
High Discount: Enjoy a 90% discount on cached tokens, the highest available.
No Storage Cost: Storage is free, reducing ongoing expenses.
Cons
Writing Cost: An extra 25% charge on input tokens when creating the cache.
Implementation Effort: Requires code changes and cost-benefit calculations.
Short Storage Time: Cache lasts only 5 minutes, necessitating rapid reuse.
Ideal For
Large-scale operations where significant savings justify the initial costs.
Use cases involving batch processing with minimal delays between requests.
Here’s an example of how to implement Prompt Caching using Anthropic models:
import anthropic
client = anthropic.Anthropic()
response = client.beta.prompt_caching.messages.create(
model=" claude-3-haiku-20240307",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
},
{
"type": "text",
"text": "<the entire contents of 'My book'>",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Analyze the major themes in 'My book'."}],
)
print(response)
Google Gemini: Flexibility at a Price
How It Works
API-Based Caching: Create and manage caches via API calls, allowing for customization.
Pros
Customizable Storage: Set your own expiration times based on project needs.
Moderate Discount: Offers a 75% discount on cached tokens.
Supports Large Contexts: Minimum of 32,768 tokens, suitable for extensive data.
Cons
Added Implementation Steps: Instead of just changing a field on the API request, you need to initialize the cache with your desired configurations.
Storage Costs: Charges apply based on token usage per hour.
High Minimum Tokens: Not ideal for smaller context sizes.
Ideal For
Applications needing long-term caching and custom expiration.
Projects that can handle very large contexts and require fine-grained control.
import os
import google.generativeai as genai
from google.generativeai import caching
import datetime
import time
# Get your API key from https://aistudio.google.com/app/apikey
# and access your API key as an environment variable.
# To authenticate from a Colab, see
# https://github.com/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb
genai.configure(api_key=os.environ['API_KEY'])
# Download video file
# curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4
path_to_video_file = 'Sherlock_Jr_FullMovie.mp4'
# Upload the video using the Files API
video_file = genai.upload_file(path=path_to_video_file)
# Wait for the file to finish processing
while video_file.state.name == 'PROCESSING':
print('Waiting for video to be processed.')
time.sleep(2)
video_file = genai.get_file(video_file.name)
print(f'Video processing complete: {video_file.uri}')
# Create a cache with a 5 minute TTL
cache = caching.CachedContent.create(
model='models/gemini-1.5-flash-001',
display_name='sherlock jr movie', # used to identify the cache
system_instruction=(
'You are an expert video analyzer, and your job is to answer '
'the user\'s query based on the video file you have access to.'
),
contents=[video_file],
ttl=datetime.timedelta(minutes=5),
)
# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
# Query the model
response = model.generate_content([(
'Introduce different characters in the movie by describing '
'their personality, looks, and names. Also list the timestamps '
'they were introduced for the first time.')])
print(response.usage_metadata)
# The output should look something like this:
#
# prompt_token_count: 696219
# cached_content_token_count: 696190
# candidates_token_count: 214
# total_token_count: 696433
print(response.text)
Choosing the Right Provider for Your Project
Considerations for Engineers
Ease of Integration:
OpenAI is the best choice if you want a plug-and-play solution with minimal code changes.
Anthropic and Google Gemini require additional coding effort.
Cost Efficiency:
Anthropic offers the highest discount but includes an initial writing cost.
Google Gemini provides a good discount but adds storage costs.
OpenAI has the lowest discount but no extra costs.
Caching Duration:
Google Gemini allows for customizable cache durations.
OpenAI and Anthropic offer limited caching times.
Context Size:
Google Gemini is suitable for very large contexts (32,768+ tokens).
OpenAI and Anthropic have lower minimum token requirements.
Project Requirements:
For short-term projects or those requiring quick setup, OpenAI is preferable.
For cost-sensitive, large-scale projects, Anthropic may be the better option.
For projects needing customization and handling large data, Google Gemini is ideal.
Practical Scenarios
Chatbots with Extensive Prompts:
OpenAI simplifies deployment with automatic caching.
Anthropic can reduce costs significantly if you're willing to manage caching parameters.
Document Analysis:
Google Gemini excels with large documents due to its high token limit and customizable caching.
Final Thoughts
Selecting the right context caching provider depends on balancing ease of use, cost savings, and project requirements. OpenAI offers simplicity, Anthropic provides maximum discounts, and Google Gemini delivers flexibility.
As engineers, you should assess:
Your team's capacity to implement and manage caching features.
Budget constraints and how they align with potential savings.
Application needs, particularly regarding context size and caching duration.
By carefully considering these factors, you can choose the provider that best fits your project's needs, optimizing both performance and cost.
Comments