Prompt Translation: The Way to Switch Between LLMs Without Losing Performance

Sep 24, 20245 min read

Updated: Oct 2, 2024

Since the debut of ChatGPT in 2023, the landscape of Large Language Models (LLMs) has evolved dramatically. Back then, the primary players were GPT-3 and BERT, but since then, the offerings have exploded with models like Gemma, Gemini, Claude, and many more versions. Hugging Face alone hosts over 15,000 different models! Sometimes the motivation to change models is compulsory—like the GPT-3.5 deprecation that happened in Q3 2024, which forced users to upgrade. Other times, it's due to optimization efforts; perhaps you've received a good offer from a cloud vendor or need a model with a larger context.

However, it turns out that switching between models isn't as trivial as simply changing the system prompts in your application. Users have reported, for example, that OpenAI models are much more elaborate and talkative than Gemini models. As a developer of an AI application, you've probably invested significant effort to tune your prompts to achieve the behavior you wanted. In this blog post, I'll present the concept of "Prompt Translation," an analytics-driven optimization method to automatically optimize translation using benchmarks and Generative AI. I'll also present how Google suggest training such model for translation using reinforcement learning and the automatic prompt translator that you can now use on Vertex AI.

Google's APD for automatic prompt translation

Automatically Fitting Prompts to Models

Prompt engineering has become a critical aspect of developing applications with LLMs. The way you craft your prompts can significantly impact the quality and relevance of the model's output. However, this process is often manual, involving a lot of trial and error. It can be time-consuming and may not yield the most effective prompts, especially when you're adapting to a new model with different behaviors.

Flow of Automatic Prompt Translator using Labeled Data

With the rapid evolution and proliferation of LLMs, the need for efficient and effective prompt engineering has become even more pressing. Not only do you need to create prompts for one model, but you also need to adapt them for new models that may interpret prompts differently. This raises a crucial question: Can we automate prompt engineering to keep up with the pace of LLM development?

PRewrite: Automating Prompt Optimization with Reinforcement Learning

To address these challenges, researchers have been exploring methods to automate prompt engineering. One promising approach is PRewrite: Prompt Rewriting with Reinforcement Learning, developed by a team from Google DeepMind and the University of Michigan.

PRewrite aims to automatically rewrite an initial, possibly sub-optimal prompt into a more effective one. It leverages the capabilities of LLMs themselves, using reinforcement learning (RL) to fine-tune a prompt rewriter model. This approach allows for the generation of improved prompts without manual intervention, making it easier to adapt to different models.

How to Train an AI rewriter? Source: Prompt Rewrite using Reinforcement Learning

How Does PRewrite Work?

At a high level, PRewrite involves training a prompt rewriter—an LLM designed to generate better prompts. Here's the step-by-step process:

Initial Prompt Input: You start with an initial prompt that may not be fully optimized.
Prompt Rewriting: The prompt rewriter LLM takes this initial prompt and rewrites it. It does so by following a "meta prompt," which instructs it on how to perform the rewriting—such as rephrasing the instruction or adding specific requirements.
Example Meta Prompt: "Rewrite the following instruction via rephrasing and/or adding specific requirements. Add instructions which would be helpful to solve the problem correctly. Output the new instruction only."
Generating Task Outputs: The rewritten prompt is then fed into the task LLM—the model that performs the end task (e.g., answering a question, classifying text). The task LLM generates its output based on this improved prompt.
Reinforcement Learning Optimization: The output from the task LLM is evaluated against a ground-truth answer using a reward function, which could be based on accuracy, F1 score, or any relevant metric. This reward is used to fine-tune the prompt rewriter LLM using reinforcement learning, specifically Proximal Policy Optimization (PPO).
Iterative Improvement: This process is repeated, continuously improving the prompt rewriter's ability to generate effective prompts that enhance the task LLM's performance.

Real-World Examples and Results

The researchers behind PRewrite tested it on several benchmark datasets, including:

AG News: A text classification dataset.
SST-2: A sentiment analysis dataset.
Natural Questions (NQ): A question-answering dataset.
GSM8K: An arithmetic reasoning dataset.

In these tests, PRewrite consistently improved the performance over the initial prompts. For instance, on the GSM8K dataset, which involves complex arithmetic problems, PRewrite increased the accuracy from 37.0% (with the initial prompt) to 83.8%, outperforming strong baselines.

Performance of Prompt Translation using RL on various datasets

Practical Implementation of Prompt Translation

Several platforms and tools are emerging to facilitate prompt translation. For instance, some cloud providers offer services that help automate the migration of prompts to their new models. These tools often integrate features like:

Instruction Optimization: Refining the task description in your prompts to align with the target model's expectations.
Demonstration Selection: Choosing the best few-shot examples to include in your prompts based on labeled examples.
Evaluation Metrics: Providing a range of metrics to assess performance, such as accuracy, BLEU scores, or custom-defined metrics.

Steps to Migrate Prompts Efficiently

Prepare Labeled Examples: Gather a set of prompts and their desired outputs that perform well on the source model.
Set Optimization Parameters: Define your target model, optimization modes (e.g., instruction optimization, demonstration optimization), and evaluation metrics.
Run the Optimization Tool: Use an automated tool or platform to optimize your prompts for the target model.
Evaluate and Iterate: Assess the optimized prompts using your evaluation metrics and refine as necessary.

Best Practices for Prompt Translation

Understand Model Differences: Before migrating, research how the target model differs from the source in terms of behavior and capabilities.
Leverage Automation Tools: Utilize available tools and platforms that specialize in prompt optimization to save time and effort.
Define Clear Metrics: Establish what success looks like by selecting appropriate evaluation metrics for your task.
Iterate and Test: Optimization is an iterative process. Continuously test and refine your prompts based on performance feedback.

Conclusion

The rapid evolution of LLMs presents both challenges and opportunities for developers. Prompt Translation offers a systematic approach to adapt prompts across different models efficiently. By treating prompt migration as an optimization task and leveraging automated tools, developers can maintain high performance while staying agile in the ever-changing landscape of AI models.

Whether you're compelled to switch models due to deprecations or motivated by the promise of better performance, Prompt Translation can ease the transition and help you harness the full potential of new LLMs.

For developers seeking assistance with prompt migration, TensorOps specializes in LLM optimization can provide support and may offer opportunities to test out new models under funded projects.