As the CTO of TensorOps, and previously as a consultant for one of the largest cloud MSPs, I've had the privilege of working with various serverless computing platforms. AWS Lambda has been a staple in serverless functions, but its limitations become apparent when dealing with demanding AI workloads.
Our customers have increasingly complained about the launch times of AI workloads, such as SageMaker Batch inference, where provisioning times can rise to 15 minutes, rendering the platform useless for scenarios where entire pipelines must run in under 30 minutes. Additionally, as pipelines have grown more complex, the deployment of workloads to the cloud has become more challenging, leaving us with limited options as engineers. A common oversight among engineers is underestimating the importance of the image used for tasks and its impact on the provisioning of new resources.
Recently, our customers have become more demanding when it comes to serving AI workloads, both for batch and online inference. That's why we decided to adopt Modal for our future use cases.
Overcoming Serverless Limitations with Modal
Serverless platforms like AWS Lambda have stringent constraints: functions are limited to 15-minute runs, 50 MB package sizes, and up to 3 CPUs with 10 GB of memory. Such limitations can hinder modern AI applications, which often require extensive compute resources and longer execution times. Modal, on the other hand, provides a robust solution by allowing functions to use up to 64 CPUs, 336 GB of memory, and 8 Nvidia H100 GPUs. This significant leap in resources ensures that even the most compute-intensive AI tasks can be handled efficiently.
The Modal Architecture: A Deep Dive
Modal's architecture is designed to handle the complexities of modern AI workloads seamlessly. At its core, Modal translates web requests into function calls, bypassing the need for traditional REST APIs. This approach leverages a high-performance infrastructure optimized for both HTTP and WebSocket requests.
In this example taken from Modal's blog post, they show how you can deploy a simple web endpoint:
from modal import App, web_endpoint
app = App(name="small-app")
@app.function()
@web_endpoint(method="GET")
def my_handler():
return {
"status": "success",
"data": "Hello, world!",
}
They claim that deploying this simple function takes about 0.75 seconds, which is significantly faster than what we're used to with equivalent services. It's critical since we've seen simple cases of AI workloads waiting for a few minutes and up to 15-20 minutes to even start working, so reduction to sub-second deployments could have critical impact on production systems.
Modal's Autoscaling Architecture Explained
Modal's autoscaling mechanism is architecturally distinct from traditional serverless platforms, which typically target lightweight, short-duration tasks. Modal supports heavy-duty computational tasks by dynamically adjusting the allocation of resources based on actual workload demands, similar to an operating system managing processes in a computer system. This feature allows it to handle significant computational demands like machine learning model training or large data pipelines without the constraints typically imposed by serverless environments, such as the 15-minute execution limit and 10 GB memory cap seen in AWS Lambda.
As mentioned, Modal leverages machines with potentially more resources per container, monitoring metrics like latency, CPU load, and memory usage to scale resources efficiently. This autoscaling is critical in avoiding idle time, which can be costly given the high resource limits. As a result, Modal scales to zero and bills by the second, ensuring that resources are economically utilized.
However, such a robust scaling mechanism comes with complexities. Managing the spin-up and cool-down phases of such high-capacity containers requires precise coordination and can introduce challenges in maintaining a persistent state across scale-downs. This makes Modal more suited for stateless operations or applications where state can be externalized without impacting performance.
Main Capabilities of Modal
Modal stands out in the serverless computing landscape due to its impressive range of capabilities tailored for high-demand AI workloads. Here’s a brief overview:
💪 High Resource Limits: Leverage up to 64 CPUs, 336 GB of memory, and 8 Nvidia H100 GPUs for compute-intensive tasks like training neural networks and rendering graphics.
📈 Efficient Autoscaling: Dynamically adjust containers based on workload, ensuring efficient resource use and cost control by scaling up during peak times and down when demand is lower.
🔄 Real-Time WebSocket Support: Facilitate real-time bidirectional communication, perfect for live updates and interactive AI applications.
🚀 Easy Integration: Simplify web endpoint setup and function invocation without extensive REST API configurations, speeding up development and deployment.
⚡ Streamlined Deployment: Deploy functions with a few lines of Python code, reducing development cycles and infrastructure management overhead.
🛠️ Robust Infrastructure: Built with Rust, ensuring high performance and reliability, efficiently handling large request bodies and streaming responses.
📅 Flexible Scheduling: Support for cron-like scheduling patterns and custom code, ideal for bespoke ETL jobs and periodic tasks without the complexity of traditional orchestration tools.
Getting Started with Modal
I already showed before a simple "hello world" example with Modal. This simple function takes less than a second to deploy and serves as a basic introduction to Modal’s capabilities. For more complex workloads, such as data-intensive video processing, Modal scales effortlessly:
This example allows you to flexibly choose a container with more memory and more CPU power when needed. You can also easily add GPUs, among other resources.
I appreciate that the infrastructure is closely coupled with the application code, although there are some downsides. In more complex systems, it's common to define the resources as part of the CI/CD code rather than within the logic itself. Thus, managing this at a larger scale might require more effort.
Conclusion
Modal presents a robust alternative to AWS Lambda, particularly for AI workloads that demand high computational power and flexibility. Its innovative architecture, combined with seamless scalability and advanced resource management, makes it an ideal choice for modern AI applications. At TensorOps, we are excited to integrate Modal into our solutions, providing our clients with the cutting-edge infrastructure necessary to thrive in today's fast-paced technological landscape.
For those looking to explore Modal further, the platform's documentation and support community are excellent resources to get you started on your journey toward efficient and scalable AI deployment.