Since OpenAI first introduced ChatGPT, the landscape of AI models has evolved significantly. While OpenAI now offers multiple versions of its models, each optimized for different use cases, with some being faster and others slower and there are many other AI vendors and models including almost 20,000 open source models on Hugging Face. With so many LLMs, the "one LLM fits all" design approach is no longer viable and engineers understand that to effectively manage LLM requests, they can choose one of two primary routing strategies: Mixture of Experts (MoE) and LLM LLM Proxy or LLM Gateway.
In this post, I’ll introduce how LLM Proxies or Gateways have a common design pattern in AI-driven applications. We’ll also explore how open-source tools can be used to create custom LLM proxies, Improving the CI/CD of AI models.
What is LLM Proxy
LLM Proxies and LLM Gateways are architectural solutions that abstract access to AI models, serving as intermediaries between your application and the LLM. These solutions not only assist with API integration but also leverage the advantages of traditional proxy servers, such as providing logging, monitoring, load balancing, and more. This architecture allows using the "right model" per task with little to non code change in the application level, avoiding a one-size-fits-all assumption.
Challenges that LLM proxy addresses
One of the primary challenges in deploying LLMs in production is the need to access multiple models from different providers, this design pattern is often referred to as "Multi LLM". As organizations transition from GPT-4o to GPT-4o-mini or explore alternatives like Anthropic, maintaining direct access to each LLM provider becomes cumbersome.
As the figure above shows, each vendor may have slightly different API, so when you want to change between models, you may need to deploy a new version of the code. You'll need to know their API and develop expertise in each one. This is, of course, difficult as there are many vendors, including open-source options. Ideally, you should address this by unifying the API access, allowing quick switching between models without modifying the application code.
This change can be also managed with LangChain however, as of today LangChain implements this in the application layer and LLM proxies allows you to apply the configuration across many applications from one place.
Here's an example to how you'd make an LLM vendor configurable by code with LLMstudio:
from llmstudio import LLM
model =Â LLM("anthropic/claude-2.1")
model.chat("What are Large Language Models?")
Monitoring and Measuring LLM Applications
Besides just unifying AI access, making your code more generic and customizable to changes in the AI engine is crucial. Very often, you can think of AI application code as a set of wrapper functions around the engine and AI model that you can replace. Basically, the same LangChain code can be used to build a chatbot (like in this example), and then you can change the model by configuration to make your chatbot smarter or faster. This should be as easy as flipping a switch.
To know which models to choose, you will need to run optimizations, and for that, effective monitoring is essential. LLM Proxies play a crucial role in this aspect by providing tools to log and monitor interactions with the models. This includes tracking latency, token usage, and response times, enabling organizations to optimize their applications based on performance metrics.
LLM Proxies in Action: LiteLLM and LLMstudio
Proxies are a core component of LLM Proxies, acting as intermediaries between clients and LLM providers. They enable organizations to route requests to different models and handle failover scenarios seamlessly. For example, LiteLLM, a Y-Combinator-funded project, or LLMstudio. simplify LLM API calls by providing a unified interface to interact with over 100 LLMs. This approach reduces the need for code changes and supports dynamic switching between models.
Open Source Solutions and Custom Integrations
LLMstudio shows how organizations can leverage LLM Proxies to provide production grade features on top of AI and LLM APIs. LLMstudio provides a UI for integrating and testing different LLMs, supporting custom extensions and enabling comprehensive logging and monitoring. This flexibility allows organizations to evaluate various models and prompts, optimizing for their specific use cases. In this blog post Diogo Azevedo shows how to deploy an LLM Proxy on top of Google Cloud Kubernetes engine and then make LLM calls to it from an app.
Smart Routing: Proxy or MoE?
As mentioned, one of the key benefits you'd expect from an LLM gateway is its ability to manage access to LLM APIs intelligently. For example, if a user requests a translation from English to Spanish, a simple LLM router can identify that the request is better suited for a specialized model fine-tuned for translation, avoiding the need for an expensive call to a foundation model like GPT-4. Hugging Face introduced this concept as a "model router" in "Hugging GPT" which could be ideally implemented once at the organizational level rather than for every individual application. However, this type of routing may be less efficient than a competing technique like Mixture of Experts (MoE), where the routing layer is integrated as part of the LLM itself.
Scalable LLM Proxy Deployments
So far, I have only mentioned the benefits of introducing centralization to the LLM APIs, but such an approach can also have downsides. To handle the increased traffic and prevent bottlenecks, LLM Proxy must be scalable. Containerized proxies can be deployed in Kubernetes clusters, enabling horizontal scaling based on demand. This ensures that the proxy server can manage large volumes of requests without compromising performance or reliability. The diagram below shows the general architecture of a scalable LLM proxy on top of K8s.
Adding Extra Security, Privacy, and Compliance
After centralizing access to LLMs and adding logging and monitoring, LLM gateways can enhance security by centralizing access control, secret management, and logging. They also facilitate compliance with data privacy regulations by allowing organizations to mask sensitive information before sending requests to LLM providers. Some solutions, like opaque.co, use small internal LLMs or neural networks to identify and redact Personally Identifiable Information (PII), ensuring that sensitive data remains within the organization.
Conclusion
LLM proxies are an emerging design pattern in AI applications, helping organizations manage access, security, and monitoring of LLM applications in production. By centralizing access, enabling network-level switching between models, and providing logging and monitoring tools, organizations can gain more granular control over their AI workloads while also reducing the integration efforts required for new LLMs. Logging and monitoring also assist in maintaining control over costs and compliance. As the use of LLMs continues to grow, the role of LLM proxies will become increasingly important in ensuring efficient and secure deployment. However, some question whether this design pattern will be as efficient compared to a model based on the Mixture of Experts approach.