Modal: Serverless GPU Computing for AI & ML
Deploy LLMs, train models, and scale batch jobs with sub-second cold starts. Python-first serverless infrastructure for AI teams—no Kubernetes, no YAML.
Modal is a serverless compute platform purpose-built for AI and ML teams. It enables developers to run GPU-accelerated workloads—LLM inference, model fine-tuning, batch processing—without managing infrastructure. Define everything in Python using decorators, and Modal handles containerization, scaling, and orchestration. Features include sub-second cold starts, multi-cloud GPU capacity, memory snapshots for faster deployments, and integrated observability. Founded in 2021, Modal has raised $111M and is trusted by enterprises like Ramp, Substack, and Suno for production AI workloads.
Pricing
Free Starter tier with $30/month in compute credits. Team plan at $250/month (includes $100/month in credits, unlimited seats, 1000 containers, 50 GPU concurrency). Enterprise plan with custom pricing for volume-based discounts and higher concurrency. On top of subscription fees, usage is pay-as-you-go: CPU at $0.0000131/core/sec (min 0.125 cores), Memory at $0.00000222/GiB/sec, GPUs ranging from $0.000164/sec (T4) to $0.001097/sec (H100). Multipliers apply for regional selection and non-preemptible sandboxes.
Frequently Asked Questions
How does Modal compare to AWS Lambda?
Modal is purpose-built for ML workloads and offers sub-second cold starts, direct GPU access, and better support for long-running jobs and large model deployments. Lambda has a 15-minute timeout, 50MB image limit, and 3 CPU maximum. Modal is ideal for AI/ML inference and training; Lambda suits event-driven glue code.
What programming languages does Modal support?
Modal is Python-first with comprehensive SDK support. JavaScript/TypeScript and Go SDKs have limited support. The platform is optimized for Python workflows, though you can containerize other languages if needed.
Can I use Modal for 24/7 always-on services?
Modal is designed for serverless, scale-to-zero workloads. While you can keep instances warm, it's not optimized for always-on services. Traditional containers or platforms like Kubernetes are better for long-running, always-available applications.
How are cold starts minimized on Modal?
Modal uses a custom container runtime built from scratch (gVisor-based, not runc/Docker), VolumeFS for fast model loading, and memory snapshots that capture container state after initialization and restore it for subsequent starts, enabling up to 10x cold start improvements.
What is included in the free Starter plan?
The Starter plan includes $30/month in compute credits, 100 containers, 10 GPU concurrency, and basic features. No upfront cost; pay per second for usage above credits. Team plan ($250/mo) adds unlimited seats, 1000 containers, 50 GPU concurrency, and higher limits.
Does Modal offer data residency or compliance?
Yes, Modal is SOC 2 and HIPAA certified. Enterprise customers can select regions for data residency. The platform supports audit logs, Okta SSO, and custom security controls for regulated industries.
Can I deploy my own models on Modal?
Yes, Modal is fully customizable. Deploy any open-source model, proprietary model, or custom inference code. You define container images, dependencies, and hardware (CPU/GPU types). Popular models: Llama, Mistral, Flux, Stable Diffusion, and custom fine-tuned variants.