Large Language Models (LLMs) have been at the forefront of the generative AI revolution, especially since the emergence of ChatGPT. However, their full potential has yet to be unlocked, and one significant barrier is cost. The expense of incorporating LLMs into your applications can range from a few cents for on-demand use cases to upwards of $20,000 per month for hosting a single instance of an LLM in your cloud environment. Additionally, there are substantial costs associated with fine-tuning, training, vector search, and scaling.
In this blog post, I will explore the factors that contribute to the expense of LLM applications and break down the costs into major components. If you're interested in other costs associated with AI initiatives, including project running costs, please please do it here.
Breaking Down the Cost of AI and Large Language Models
When analyzing the cost of LLMs, it's useful to consider two main perspectives: the factors that can make LLMs more expensive and the individual cost components involved. Let's first overview the factors.
Factors That Make LLMs and AI Models More Expensive
Model Complexity (Thought Effort):Â This refers to how sophisticated you want the model to be. Increasing the model's intelligence often involves using more complex architectures or larger models, such as scaling from 7 billion to 300 billion parameters or employing more experts simultaneously in a Mixture of Experts (MoE) model. Higher thought effort increases computational demands and costs. Obviously, smarter models are more expensive.
Input Size:Â The number of tokens you send to and receive from the model affects processing time and computational resources. Larger inputs and outputs require more power to process, thereby increasing costs. The more input and output you have, the more you pay.
Media Type: The type of media the model processes—be it text, audio, or video—impacts the cost. Processing audio and video typically demands more resources than text due to larger data sizes and complexity.
Latency Requirements:Â How quickly you need the response influences cost. Lower latency requires more computational resources or optimized infrastructure, which can be more expensive to maintain.
How Can You Pay for Running LLMs?
Typically, you'll find yourself paying for LLMs in production in one of two main ways:
Hosting Your Own Infrastructure: You can build and manage your own infrastructure to host LLMs, either on-premises or in the cloud. For instance, you might download a model like Llama 3 and run it on your servers. This approach offers control and customization but involves significant upfront investment and ongoing maintenance costs.
Model as a Service:Â Alternatively, you can use LLMs provided by AI vendors as a service. These vendors typically adopt a pay-per-token pricing model, where you are charged for each token sent to and received from the model. This approach can be more cost-effective for those who prefer not to handle the complexities of infrastructure management or are operating at a lower scale.
Before delving into cost estimations for some popular models, let's briefly discuss training your own LLM.
I'll start my discussion about the cost of LLMs by addressing cost estimations for some popular models, but beforehand, a word about training your own LLM.
The Cost of Creating LLMs
While most users won't train their own LLM from scratch, there have been reports that the cost of training LLMs such as BloombergGPT reaches millions of dollars, primarily due to GPU costs. Training LLMs today involves investing in research, acquiring and cleaning vast amounts of data, and paying for many hours of human feedback through techniques like Reinforcement Learning from Human Feedback (RLHF). While most companies that integrate LLMs into their generative AI applications use models trained by other organizations (like OpenAI's GPT-4 or Meta's Llama 3), they indirectly pay for the costs associated with creating these LLMs. Now that we've clarified this, let's look at a few cost examples of hosting your LLMs in the cloud.
Hosting an LLM in Your Cloud
When it comes to hosting your own model, the main cost, as mentioned, is hardware. Consider, for example, hosting an open-source model like Llama3 on AWS. The default instance recommended by AWS is ml.p4d.24xlarge, with a listed price of almost $38 per hour (on-demand). This means that such a deployment would cost at least $27,360 per month (assuming 24/7 operation), assuming it doesn't scale up or down and that no discounts are applied.
Scaling up and down of this AWS service may require attention, configuration changes, and optimization processes; however, the costs will still be very high for such deployments.
Paying Per Token
An alternative to hosting an LLM and paying for the hardware is to use Software as a Service (SaaS) models and pay per token. Tokens are the units vendors use to price calls to their APIs. Different vendors, like OpenAI and Anthropic, have different tokenization methods, and they charge varying prices per token based on whether it's an input token, output token, or related to the model size.
For example, OpenAI charges $0.03Â per 1,000 input tokens and $0.06Â per 1,000 output tokens for GPT-4, while GPT-3.5 Turbo costs $0.0015Â per 1,000 input tokens and $0.002Â per 1,000 output tokens. It's evident that using special characters or non-English languages can result in higher costs due to tokenization. If you are using other languages, such as Hebrew or Chinese, be aware that the costs may be even higher.
What Factors Can Influence LLM Call Costs?
Pay-by-token pricing for LLMs varies based on several factors, including model capabilities, usage patterns, and language differences. Here's how these factors influence costs, with precise examples and tables for popular models:
1. Model Selection and Capabilities
Choosing a model significantly impacts cost due to differences in performance and features.
Example: GPT-4 vs. GPT-3.5 Turbo
GPT-4:Â Advanced multimodal model with enhanced capabilities, 8K context length (or 32K in the extended version), and a knowledge cutoff of 2023. Ideal for complex tasks.
GPT-3.5 Turbo:Â Cost-efficient, faster, and suited for general applications with less complex requirements.
Pricing Comparison:
GPT-4:Â $0.03 per 1,000 input tokens, $0.06 per 1,000 output tokens.
GPT-3.5 Turbo:Â $0.0015 per 1,000 input tokens, $0.002 per 1,000 output tokens.
Impact on Cost:
Processing 1 million input and output tokens:
GPT-4:
Input: (1,000,000 tokens / 1,000) * $0.03 = $30
Output: (1,000,000 tokens / 1,000) * $0.06 = $60
Total Cost:Â $90
GPT-3.5 Turbo:
Input: (1,000,000 tokens / 1,000) * $0.0015 = $1.50
Output: (1,000,000 tokens / 1,000) * $0.002 = $2.00
Total Cost:Â $3.50
Conclusion:Â GPT-4 offers superior capabilities at a higher cost, while GPT-3.5 Turbo provides affordability for less demanding tasks.
2. Batch API Discounts
In cases when it's not urgent to get immediate responses, using the Batch API can reduce costs by 50% for tasks that can wait up to 24 hours.
Pricing with Batch API:
Example:
Batch processing 2 million input and output tokens with GPT-4o Mini:
Input: 2 x $0.075 = $0.150
Output: 2 x $0.300 = $0.600
Total Cost:Â $0.750
Conclusion:Â Batch API is cost-effective for non-urgent, large-scale tasks.
3. Thought Process
Models like o1-preview can applu a thought processes for complex reasoning, indicative of "Agent as a Service."
Features of o1-preview:
Enhanced reasoning for complex tasks.
Internal reasoning tokens included in output token count.
Pricing:
Example:
Processing 500,000 input and output tokens with o1-mini:
Input: 0.5 x $3.00 = $1.50
Output: 0.5 x $12.00 = $6.00
Total Cost:Â $7.50
Conclusion:Â While more expensive, o1 models solve complex problems efficiently, reducing the need for multiple simpler model calls.
4. Media Type Processing
Processing different media types (text, images, audio) affects pricing due to varying computational demands.
Image Generation with DALL·E Models:
Example:
Generating 50 images at 1024×1024 resolution:
DALL·E 3: 50 x $0.040 = $2.00
DALL·E 2: 50 x $0.020 = $1.00
Conclusion:Â Higher-quality images cost more; select based on quality needs and budget.
While we've covered the differences in API pricing, this still doesn't address the true costs of LLMs in production. Let's explore the true cost of AI in depth.
Hidden costs of LLM applications
GPT-For-Work has developed a cross-platform pricing calculator for AI and LLM products. I utilized it to estimate the cost of an AI application that processes 1 million requests and was immediately faced with the question: How many tokens will be sent in such a case? The answer is complex, as this number is influenced by several hidden and unknown factors:
Variable Input and Output Sizes:Â The size of the user input and the generated output can vary significantly, affecting the number of tokens used.
Hidden Prompt Costs:Â There are hidden costs associated with application prompts. System prompts and instructions can add a significant number of tokens to each request.
Background API Calls:Â Utilizing agent libraries typically incurs additional API calls in the background to LLMs, in order to implement frameworks like ReAct or to summarize data for buffers.
These hidden costs are often the primary cause of bill shock when transitioning from the prototyping phase to production. Therefore, generating visibility into these costs is crucial.
TThe Emergence of Vector Databases
Most of the previous discussion has focused on hosting LLMs and the exchange of data with an LLM. However, LLMs have proven useful not only for on-demand generation use cases but also for creating a new format for data storage known as embeddings. These embeddings are vectors (arrays of numbers) that can represent various media types such as text, images, audio, and video. Once data is compressed into vectors, it can be stored and indexed for advanced search purposes.
Weaviate a leading vector database solution, has demonstrated the efficacy of storing and retrieving data in a vectorized form. What makes it special is the ability to search for media objects that are conceptually similar, such as emails that share the same tone or mention the same topics, even if the exact words are not used.
These new databases are significantly more expensive, as the creation and updating of embeddings are done through invoking LLMs. Additionally, searching the database requires more advanced and costly techniques.
How to control the cost of LLMs?
Having explored the components and factors that influence the costs of deploying large LLMs and AI in production—particularly the underlying infrastructure—we can now turn our attention to cost reduction strategies. One effective approach is to optimize hardware performance. By choosing faster or more advanced GPUs, you can significantly increase inference speed, but this often comes at a higher cost. It's clear that the main balance in LLMs is between performance and cost; it involves trade-offs between speed, cost, and accuracy. While some cost reductions may come from vendors offering improvements in hardware and algorithms, what techniques can you employ to squeeze more performance out of your existing infrastructure?
Choosing the Size of the LLM
The size of the LLM plays a crucial role in its performance and cost. Larger LLMs typically offer greater accuracy but at the expense of higher costs due to increased resource requirements. For example, upgrading from GPT-3.5 Turbo to GPT-4 can provide more accurate results but will also incur higher expenses. This decision requires careful consideration of the balance between accuracy and cost.
The following chart illustrates the impact of various prompt engineering techniques on the accuracy of LLM applications across different LLMs, ordered by size. While there is variance in performance and even some overlap between different LLMs that can be achieved through effective prompt engineering, the size of the LLM plays a critical role in determining accuracy. This implies that higher costs (since larger LLMs are more expensive) correlate with greater accuracy.
This table presents the pricing for various GPT models offered by OpenAI as of August 2023. It illustrates the significant differences in potential expenses. Later in this post, we will discuss strategies to ensure you select the most suitable LLM for your use case and budget.
Quantizing LLMs
Quantizing LLMs is a technique that reduces the precision of the model weights, leading to improved performance in terms of speed and resource utilization, with a trade-off of slightly reduced accuracy. This method can be particularly effective in managing costs while still maintaining acceptable performance for many applications. By quantizing your LLMs, you can achieve a more cost-effective balance between performance and accuracy.
Quantizing the models reduces their size significantly, resulting in decreased costs of hosting the LLM and improved latency. But at what accuracy compromise? For the selected benchmark, Llama2-13B quantized has shown better results than Llama2-7B, despite being almost 50% smaller in size. Read further in Miguel Neves' post.
Fine-Tuning LLMs
Fine-tuning LLMs for specific tasks can offer significant improvements in performance. If your LLM is required for a particular function, making a one-time investment in fine-tuning can enhance its effectiveness for that task. This approach allows for a more efficient use of resources by tailoring the model's capabilities to your specific needs, potentially reducing overall costs in the long run.
Constructing Better Prompts
System prompts are templates used in LLM applications to give instructions to the models, in addition to injecting specific data like user prompts. Crafting better system prompts can greatly improve the accuracy of LLMs and reduce instances of hallucination. Techniques such as "chain-of-thought" prompting can minimize errors by guiding the model through a more logical process of generating responses. However, this method may increase the amount of data sent to the model, thereby raising costs and potentially affecting performance. Optimizing prompt design is a crucial aspect of managing trade-offs between cost, accuracy, and efficiency.
To provide perspective, consider the reported "leaked" prompt of GitHub Copilot, the application that helps developers autocomplete code. Although not confirmed, the reported prompt contains 487 tokens, incurring significant out-of-the-box costs before introducing any user-specific context.
Using an analytic approach
While there is intuition and theory about how to use these techniques, it's often difficult to predict in advance which method will be more effective when optimizing LLM systems. Therefore, a practical solution is to adopt an analytical approach that allows you to track different scenarios and test them against your data.
Tools like LLMstudio, an open-source platform from TensorOps, facilitate exactly this. It enables you to test different prompts against various LLMs from any vendor and log the history of the requests and responses for later evaluation. Additionally, it tracks key metrics related to cost and latency, enabling you to make data-driven decisions regarding your LLM deployment optimization.
A word from your trusted advisors
As a consulting company, we have encountered several instances where LLM deployments failed because the unit economics were not viable. LLMs can be thought of as SpaceX's rockets that can return safely to Earth. Although it's groundbreaking technology, you wouldn't use it to go to the grocery store, right?
We assist companies in assessing the efficiency of their LLM deployment through a structured process that spans three weeks. If your company is interested in optimizing your LLM deployment, leave us a message on our website.