February 2024 marked a pivotal moment in the AI field as Google unveiled Gemini 1.5 Pro, boasting an impressive 1M tokens of context capacity. This advance, closely following OpenAI's "turbo" enhancement to GPT-4 and Anthropic's Claude 2.1, has reignited the discourse on the relevance of Retrieval-Augmented Generation (RAG) in an era dominated by Large Context Models. The introduction of Gemini 1.5 not only expanded the horizons of LLM capabilities but also redefined competitive paradigms, pitting Large Context against RAG. This post aims to explore the implications of Google's groundbreaking innovation, and how it stands in comparison with RAG.
TL;DR: Who Wins?
RAG systems provide flexibility and cost savings by retrieving specific information from external databases, whereas large context models like Gemini 1.5 Pro offer in-depth learning and reasoning on a given context, though with potentially higher costs and latency.
It is important to note that the use cases of LLM applications are vast and have different goals. So, comparing the two is not so straight-forward and it should be done case-by-case. In the following table, we compile some heuristics to aid the distinction and choice between the two.
RAG | Large Context (Gemini 1.5 Pro) | |
Typical Use Case | Chatbot with search engine capabilities Finding specific information in large DB | Reasoning over complex scenarios that require big-picture understanding (ex: large code-base) |
Context Understanding | Highly dependant on data retrieved & model chosen | Shows complex reasoning and in-context learning of skills |
Cost | More cost-effective (less tokens) | Potentially higher cost due to per-token pricing |
Latency | Potentially lower latency through targeted retrieval (fewer tokens) - depends on model | Model is well optimized: 1m tokens ~= 1 min For the same model, a larger context means higher latency |
Flexibility & Customization | Offers greater flexibility with customized data sources and retrieval strategies Building and maintaining RAG systems can be complex and resource-intensive | Simplifies development, beneficial for teams unfamiliar with RAG Operates as a 'black box', limiting customization |
Security & Data Privacy | More control for security and privacy, not tied to a single provider | Potential challenges in ensuring data privacy and compliance with regulations, by relying on an external provider |
We'll deep dive on these differences in a later section, but before that, let's explore the context behind these two approaches.
Understanding the Context of LLMs
When we say "context" in LLMs, we're referring to how many words (or tokens) the model can handle in one go. This is a big deal because LLMs can't recall what was said in previous interactions, unlike humans or animals that remember past experiences.
To give LLMs a sort of memory, especially in applications such as ChatGPT, developers use a workaround: they create a temporary memory that holds onto the previous parts of the conversation. It's like keeping a notepad where the LLM scribbles down notes from the current session so it can refer back to them when needed.
However, there's a limit to how much the models can remember at once. If you're dealing with a really long chat or a large document, the LLM might struggle because it can only process so many words at a time.
The Drive for Innovative Design Patterns in LLM Application Development
The wide range of skills that LLMs have brought has led to the emergence of innovative design patterns in LLM application development, including LLM REST APIs, Chains, Agents, RAG, and Large Context LLMs. The emergence of these patterns reflects attempts to navigate around the inherent limitations of LLMs, particularly those related to context size and diversity. By focusing on these areas, we get a front-row seat to the ongoing evolution of how LLMs are used. It's a testament to the creativity and innovation of developers as they come up with solutions to the challenges LLMs face.
Retrieval-Augmented Generation (RAG)
Combines the power of LLMs with efficient information retrieval techniques, like vector or lexical search, to provide answers derived from existing knowledge bases or datasets. This approach is effective when the answer to a query exists within the data, allowing for the retrieval of relevant information that an LLM then synthesizes. This gives LLMs access to information that would not fit in their context window.
For example, the question "What was the interest rate decision of February 2024?" is one that an LLM most likely won’t have the answer to on its knowledge base. By having access to a large database of recent articles, where one might contain a paragraph describing the event, it’s possible to search for keywords and present only the relevant articles or even paragraphs and feed it as context to the LLM, enabling it to provide the correct answer.
Large Context Models
Instead of relying on external data sources via information retrieval, significant effort is being put towards creating LLMs with bigger and bigger context windows, so that more information can fit inside them and be used directly to generate the desired responses.
The context window size of state-of-the-art models has been increasing, with OpenAI's GPT-4 Turbo having a 128k tokens of context window and Anthropic offering 200k tokens of context window with their Claude 2.1 and the just released Claude 3 models.
But the biggest and most recent advance is coming from Google, which presented Gemini 1.5 Pro with a context window size that can go up to 1 million tokens in production.
This immense context window allows the model to ingest and process vast amounts of information, such as 700,000 words or codebases with over 30,000 lines of code, in a single prompt, unlocking a new set of use cases that were not possible before.
How did Google achieve this?
Gemini 1.5 Pro, leverages a sparse mixture-of-experts (MoE) and Transformer-based architecture, built upon the research advancements of its predecessor, Gemini 1.0. The MoE approach, utilizing a learned routing function, enables efficient allocation of computation by activating only a relevant subset of parameters for each input, allowing the model to scale its parameter count effectively without increasing computational demands. Key improvements across the model's architecture, data handling, optimization, and systems contributed to making Gemini 1.5 Pro powerful, efficient in training and serving, and also allow for the understanding of long-context inputs. You can read further into mixture-of-experts on Miguel Neves' post.
What does this mean for RAG?
A first thought one might have is that with this context window revolution will slowly replace RAG. This poses the question of why we need to add an information retrieval step when new models are capable of handling large amounts of information on their own.
While it is true that Gemini 1.5 sparks this conversation, it cannot be declared the big winner without properly addressing how it compares in terms of latency, costs, precision, and other important success metrics.
Moreover, it’s important to note that LLMs have a lot of different purposes and each use case has its particularities that should be addressed when designing the system, so using a generic solution for everything is dangerous.
RAG vs. Large Context Models: The Differences
As explained in the beginning of the post, the choice between RAG and Large Context Models like Gemini 1.5 Pro depends on the specific requirements of cost, latency, context understanding, flexibility, and data privacy of the use case.
So, let’s deep dive into how Gemini 1.5 compares to RAG across this different aspects.
Cost
One immediate drawback of using Gemini 1.5 comes from the fact that the API calls to the model are paid by token, so the more tokens are given as context, the more costly the call will be. Using a RAG system, on the other hand, by having the relevant information retrieval step independent from the LLM, reduces the number of tokens of the API calls, therefore reducing the cost. This makes it a more cost-effective solution for organizations of all sizes.
Exemplifying with the current Gemini API pricing, each token costs approximately $0.005, so by using the full 1M tokens, we can expect $0.5 per call, while if we followed a RAG architecture, we would first retrieve the relevant context for the prompt and pass only that to the LLM. If let's say only 1% of the text was relevant, this would result in a $0.005 call.
Latency
There aren’t official numbers provided by Google since the model is still in the early stages of release, it is expected to have a higher latency than when providing a smaller context, as shown in this demo, where the model takes between 14 to 30 seconds to generate the responses when given a 402-page document as context. For longer contexts, nearing the 1M tokens, Google claims approximately 1 minute between responses. Although this is an impressive achievement, some use cases might need a lower latency, which can be achieved by using a more modest but faster model with RAG.
Context Learning
This is where Gemini truly shines, by being able to hold more information in its context, the model can show complex reasoning and in-context learning skills.
A notable demonstration of Gemini 1.5 Pro's context learning capabilities is its ability to reason about details from the 402-page Apollo 11 moon mission transcripts. This showcases the model's unique capability to handle inputs significantly longer than those manageable by its contemporaries.
Additionally, Gemini 1.5 Pro has shown proficiency in learning new skills from information in long prompts without additional fine-tuning.
Google showcased this by giving Gemini a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, and sentences for it to translate. The model was capable of translating English to Kalamang at a similar level to a person learning from the same content.
When given a prompt with over 100,000 lines of code, Gemini 1.5 Pro could reason across examples, suggest modifications, and explain different code parts, demonstrating its advanced problem-solving skills across longer code blocks.
This is something that cannot be achieved so simply with traditional RAG.
Long Context Recall
A recent study exploring the capabilities of LLMs on long input contexts, focusing on tasks that require identifying and using relevant information from within these contexts,shows an interesting finding: the observation of a distinctive U-shaped performance curve across various models.
Indicating that models perform best when the relevant information is placed at the beginning or end of the context, with performance declining when the information is located in the middle.
This study was conducted before the release of Gemini 1.5 and therefore does not include it in its research, but it raises awareness on how increasing context window size might not fully address the nuanced difficulties of long-context recall and utilization.
Also, in their technical report, Google shows how Gemini 1.5 Pro performs on the “Needle in a Haystack” test, which asks the model to retrieve a few given sentences from a provided text.
As can be seen in the image below, Gemini has a higher recall at shorter context lengths, and a small decrease in recall toward 1M tokens, where the recall tends to be around 60%.
This means that around 40% of the relevant sentences are “lost” to the model. In a scenario where recall is key, it is best to curate it first and only send the most relevant content, with traditional RAG.
Flexibility and Customization
By relying only on the large context model, it becomes a "black box" approach and reduces the number of ways to adjust the system. On the other hand, RAG takes a different approach. It gives developers more control, allowing them to adjust data sources, retrieval strategies and tweak the input given to the LLM as needed. This makes the workflow more customizable and transparent.
However, for teams less familiar with RAG or those that need a faster product delivery, using a large context model instead of building a RAG system can be a more practical solution. It might be less efficient, but it simplifies the development process.
Security and Data Privacy
RAG systems offer more control over security and data privacy, since they are not limited to a single provider's ecosystem. This capability minimizes data breach risks and boosts privacy. The use of RAG in a self-hosted model grants developers control over data privacy, with possibility of meeting strict security needs through permission-based access and ensuring sensitive information is protected by not being exposed to external models.
On the other hand, using Gemini 1.5 means puting compliance with data protection regulations in the hands of an external provider. Google does put effort in ensuring data privacy and is compliant with industry certifications such as SOC2, but for certain use cases where data sovereignty and privacy are paramount a thorough assessment should be made. Additionally, reliance on a single provider's ecosystem can limit the ability to tailor security measures to specific needs or to leverage data across multiple, possibly more secure or private, platforms.
Practical Example: The Harry Potter Challange
RAG stands out in situations where the task involves uncovering answers hidden within extensive knowledge repositories. Take, for example, the task of extracting a particular detail from a famous story, like identifying the specific type of candy Harry Potter enjoyed on the Hogwarts Express. In such cases, RAG demonstrates its strength by quickly finding the relevant details and using the LLM to generate a precise and concise answer. This comes from efficiently parsing vast amounts of data, distilling the key information, and synthesizing it into an accurate response, all the while ensuring cost-effectiveness.
However, when faced with complex queries that demand a deep dive into extensive narratives or subjects, RAG's retrieval-based method may not be the best option. These are the times when Large Context Models become very important. They are good at going through and understanding a lot of information, which helps them answer questions that need a deep knowledge of full stories or complicated ideas. For instance, to explain the story developments from Hermione Granger's view in the "Harry Potter" series, you need to look at many books and lots of conversations. Here, RAG's method of finding and pulling out certain details isn't enough. Large Context Models can go through all of the text to create complete understandings.
The distinction between RAG and Large Context Models illuminates a fundamental tradeoff in LLM application development. While RAG offers a streamlined, focused method for answering queries that can be directly retrieved from existing datasets, it is less adept at navigating the complexities of broader, more nuanced queries. Large Context Models, on the other hand, excel in these elaborate explorations, at the cost of greater computational demands and potentially higher operational complexity.
Can They Complement Each Other?
Oriol Vinyals, VP of Research at Google DeepMind, emphasized the ongoing relevance of RAG despite advancements introduced by Gemini models to handle large amounts of context. He adds that combining RAG with long-context models might be an interesting way to push the boundaries of AI's capabilities. This approach aims to leverage the strengths of both technologies: the deep understanding and comprehensive processing power of long-context models, and the dynamic, up-to-date knowledge retrieval of RAG systems. Such integration could lead to AI outputs that are not only coherent and accurate over long contexts but also factually precise and relevant by pulling in the latest information.
Future
The future landscape of RAG versus large context models is going to be constantly evolving in the next months and years.
Google is already experimenting on expanding Gemini's context window to 10 million tokens, but with the current hardware and model architecture this option is still not viable for production.
Although we haven't reached that point yet, as hardware evolves and research progresses, we can anticipate a reduction in latency and costs associated with running these advanced LLMs. This trend could potentially lead to a shift away from RAG applications for many use cases in the future. As LLMs become more efficient and capable of handling extensive contexts on their own, the reliance on RAG to supplement LLMs with external knowledge may decrease, signaling a significant evolution in how AI systems are designed and utilized for complex tasks.
Conclusion
The emergence of large context models like Google's Gemini 1.5 Pro represents a transformative moment in the realm of artificial intelligence, challenging traditional paradigms and pushing the boundaries of what's possible with LLMs.
While Retrieval-Augmented Generation (RAG) systems offer a pragmatic and cost-effective solution for queries that can be directly answered from existing knowledge bases, the large context models bring unparalleled depth and nuance to the analysis of complex questions.
Choosing between them boils down to the use case, reminding us that there's no one-size-fits-all answer in the world of AI applications. Each approach has its merits, and the decision hinges on the understanding of the problem being addressed. In some scenarios, RAG's efficiency and directness might be preferable, whereas in others, the comprehensive understanding and in-depth analysis achieved by large context models could be more useful.
As the landscape of LLMs continues to evolve, the interplay between these two methodologies will likely change. The key to navigating this lies in maintaining a flexible approach, open to leveraging the strengths of both RAG and large context models. By doing so, developers and researchers can ensure that they are using the most effective tools available, tailored to the unique challenges and opportunities of each project.
You only alluded to the most important reason for RAG - accuracy.
A properly designed RAG can be used to return factually correct information.
An LLM without RAG returns responses that must be verified by someone (a human) who already knows the correct answer, because there will be extremely confidently stated inaccuracies in the responses of an LLM without RAG.
Increasing the context window won't fix this shortcoming.