Fireworks AI: Fastest LLM Inference API 2026 (167 t/s)

Last updated: 2026-06-18

Fireworks AI is the fastest LLM inference API, delivering 167 t/s from $0.20/1M tokens with 99.8% uptime. SOC 2 Type II, HIPAA, GDPR. Trusted by Cursor.

Fireworks AI is the highest-throughput open LLM inference API, delivering 167 tokens per second on DeepSeek V4 Pro from $0.20 per million tokens. Founded in 2022 by Meta PyTorch engineers, it has 99.8% uptime and serves 10,000+ enterprise customers including Cursor, Perplexity, and Uber. SOC 2 Type II, HIPAA, GDPR, and ISO 27001 certified. New accounts get $1 in free credits.

About Fireworks AI

Fireworks AI is an enterprise inference platform for open-source LLMs, built in 2022 by seven engineers from Meta's PyTorch team. The company raised $327 million in total funding, reaching a $4 billion valuation with its October 2025 Series C co-led by Lightspeed Venture Partners and Index Ventures. With $800 million in annualized revenue as of May 2026 and 10,000+ enterprise customers, Fireworks has established itself as the leading independent inference infrastructure company outside the major cloud providers. The platform's speed advantage comes from FireAttention, a custom CUDA kernel stack optimized end-to-end for transformer inference. FireAttention V4 on B200 GPUs delivers 167 to 174 tokens per second on DeepSeek V4 Pro, which is 5 times faster than competing providers at the same price point. The inference engine handles disaggregated serving, semantic caching, and speculative decoding internally, resulting in 99.8% uptime and a P99 latency only 3.9x the median P50 rather than the typical 10x spikes seen elsewhere. The platform processes over 13 trillion tokens daily and sustains approximately 180,000 requests per second. Fireworks is best suited for engineering teams that need production-grade, low-latency inference on open models without managing GPU infrastructure. Named customers include Cursor (code completion), Perplexity (search AI), Notion (writing AI), Sourcegraph (code search), Uber, DoorDash, Shopify, and Upwork. The platform ships full fine-tuning covering supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement fine-tuning (RFT), with up to 100 LoRA adapters deployed simultaneously at no extra cost per adapter. Pricing is strictly pay-per-token with no seat-based fees. Serverless inference starts at $0.20 per million tokens for 8B-class models and $0.90 per million tokens for 70B-class models. On-demand GPU deployments cost $2.90/hr for A100 80GB, $6.00/hr for H100/H200, and $9.00/hr for B200. New accounts receive $1 in starter credits. The API is OpenAI-compatible and now supports MCP (Model Context Protocol) through the Responses API in beta, enabling agentic tool-calling workflows with a single API call.

Pricing

Free: $1 starter credits, no subscription required. Serverless: $0.20/1M tokens (8B models), $0.90/1M tokens (70B models). Batch inference at 50% of serverless rates. On-demand GPU: $2.90/hr A100 80GB, $6.00/hr H100/H200, $9.00/hr B200. Volume, nonprofit, education, and startup discounts available.

Key Features

Pros

Cons

Frequently Asked Questions

What is Fireworks AI and what does it do?

Fireworks AI is an enterprise inference and fine-tuning platform for open-source large language models, founded in 2022 by seven engineers from Meta's PyTorch team. The platform gives developers access to 400+ open models including Llama 4, DeepSeek V4 Pro, Qwen 3, and Mixtral through an OpenAI-compatible API. Fireworks differentiates on speed: its proprietary FireAttention V4 CUDA kernel stack delivers 167 tokens per second on DeepSeek V4 Pro, which is 5 times faster than competing providers at the same price. The company has raised $327 million in total funding, reached a $4 billion valuation in October 2025, and serves 10,000+ enterprise customers including Cursor, Perplexity, Notion, Uber, DoorDash, and Shopify. It also ships full managed fine-tuning, on-demand GPU deployments, and an MCP-compatible Responses API for building agentic workflows. The platform processes over 13 trillion tokens per day with 99.8% uptime.

How much does Fireworks AI cost in 2026?

Fireworks AI uses pay-per-token pricing with no monthly subscriptions or seat fees. Serverless inference starts at $0.20 per million tokens for 8B-class models (such as Llama 3.1 8B) and $0.90 per million tokens for 70B-class models (such as Llama 3.3 70B or DeepSeek V4 Pro). Cached input tokens are billed at 50% of the standard rate by default. On-demand dedicated GPU deployments are priced at $2.90 per hour for an A100 80GB, $6.00 per hour for H100 or H200, and $9.00 per hour for B200. Batch inference is priced at 50% of serverless rates. New accounts receive $1 in free starter credits, enough to run thousands of inference calls on smaller models. Volume discounts, nonprofit pricing (40-80% off), education pricing (50-90% off), and startup program discounts are available on request. Enterprise customers with high monthly token volumes can negotiate annual contracts for 15-20% savings.

What are the main features of Fireworks AI?

The four core capabilities are high-speed inference, a broad open model catalog, managed fine-tuning, and enterprise compliance. On inference, FireAttention V4 delivers 167 tokens per second on DeepSeek V4 Pro, with disaggregated serving, semantic caching, and speculative decoding built into the engine. The model catalog covers 400+ models across text generation, vision, function calling, embedding, and image generation (FLUX and SDXL). Fine-tuning supports SFT, DPO, and reinforcement fine-tuning (RFT), with LoRA and full-parameter training, and up to 100 LoRA adapters deployable simultaneously at no extra cost per adapter. The Responses API (beta) enables agentic workflows with MCP tool integration, handling the full reasoning and tool-execution loop server-side. The API is fully OpenAI-compatible, allowing drop-in migration without code changes. Security certifications include SOC 2 Type II, HIPAA, GDPR, ISO 27001, ISO 27701, and ISO 42001.

Is Fireworks AI free to use?

Fireworks AI does not have a permanent free tier, but all new accounts receive $1 in free starter credits, which covers hundreds of inference calls on 8B models at $0.20 per million tokens. Once starter credits are used, the account switches to pay-per-token billing with no minimum monthly commitment. There is no free plan with recurring monthly credits. However, Fireworks AI offers significant discounts for nonprofits (40-80% off standard rates), educational institutions (50-90% off), and early-stage startups through an application-based startup program. Developers evaluating the platform can use the web playground on fireworks.ai to test models in the browser without an API key, though production use requires an account and API key. Enterprise teams requiring high-volume usage should contact sales for custom rate agreements.

What are the best alternatives to Fireworks AI?

The three closest alternatives are Groq, Together AI, and Replicate. Groq runs its own custom LPU (Language Processing Unit) silicon, which achieves 456 tokens per second on supported models, beating Fireworks on raw speed, but Groq's model catalog is limited and it does not support fine-tuning. Together AI has the broadest open model catalog, particularly for Qwen and MoE variants, and offers longer-standing batch pricing, but its inference speed and uptime lag behind Fireworks. Replicate is better for prototyping and image or video model access but is not designed for high-throughput enterprise LLM inference. For teams that prioritize the absolute lowest latency above all else, Groq is the better fit. For teams that need a balance of speed, model variety, fine-tuning, and enterprise compliance in one platform, Fireworks AI is the stronger choice.

Who is Fireworks AI best for?

Fireworks AI is best for ML engineers and AI platform teams building production applications on open-source LLMs who need enterprise-grade reliability without managing their own GPU infrastructure. Specific use cases where it excels include code completion (Cursor uses it for this), AI-powered search (Perplexity), multi-step agentic workflows via the MCP Responses API, and fine-tuned domain-specific models in regulated industries. Teams in healthcare, finance, or government that need HIPAA BAA agreements and SOC 2 Type II attestation will find Fireworks' compliance posture difficult to match among inference-only providers. It is not a good fit for teams whose primary workload is image or video generation, since the catalog is limited to roughly 5 image models and zero video models. Solo developers or small teams who want a no-code AI product rather than API-first infrastructure should also look elsewhere.

Does Fireworks AI have an API?

Yes, Fireworks AI is entirely API-first. The REST API is fully OpenAI-compatible, meaning any code written for the OpenAI SDK works with Fireworks by changing only the base URL to https://api.fireworks.ai/inference/v1 and substituting a Fireworks API key. The API covers chat completions, embeddings, image generation, and function calling. Fireworks also ships a Responses API in beta that natively supports MCP (Model Context Protocol), allowing agents to connect to external tools and data sources through a standardized interface with the full reasoning loop handled server-side. Integrations are available for Vercel AI SDK, LangChain, Langfuse, Promptfoo, Microsoft Azure AI Foundry, and CodeGPT. Fine-tuning is managed through a separate Training API that supports SFT, DPO, RFT, and custom training loops for advanced ML teams.

How does Fireworks AI compare to Groq in 2026?

Groq and Fireworks AI are the two fastest independent LLM inference providers in 2026, but they target different use cases. Groq's custom LPU hardware reaches 456 tokens per second on supported models, compared to Fireworks' 167 t/s on DeepSeek V4 Pro, giving Groq a raw speed advantage on its supported model set. However, Groq's model catalog is small (roughly 20-30 models) and does not include DeepSeek V4 Pro, Qwen 3, or FLUX, while Fireworks serves 400+ models. Groq does not offer fine-tuning at all; Fireworks covers SFT, DPO, RFT, and LoRA with up to 100 simultaneous adapters. On compliance, Fireworks holds SOC 2 Type II and HIPAA certifications, while Groq's compliance coverage is more limited for regulated industries. Fireworks is the better choice for teams that need model variety, fine-tuning, compliance, or agentic MCP workflows. Groq is the better choice for teams where raw inference speed on a small set of open models is the only decision criterion.

Visit Fireworks AI Official Website