Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
5 min read·Updated May 15, 2026

Cerebras Inference

Cerebras logoBy Cerebras

Cerebras Inference is an AI inference platform powered by the world's largest chip — the Wafer-Scale Engine 3 — delivering the fastest token generation speeds for large language models, with partnerships from OpenAI, AWS, and Meta. Cerebras completed the largest US tech IPO of 2026 on May 14, pricing at $185 and closing +108% at $311 for a $66 billion market cap.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand what Cerebras Inference is and how wafer-scale chip technology works
  • Compare Cerebras speed benchmarks to GPU-based and competing inference platforms
  • Evaluate Cerebras pricing tiers and the significance of its OpenAI and AWS partnerships

What Is Cerebras Inference?

Cerebras Inference is an AI inference platform built on the Wafer-Scale Engine 3 (WSE-3) — the largest chip ever made. While traditional GPUs use chips the size of a postage stamp, the WSE-3 is the size of an entire silicon wafer: 4 trillion transistors, 900,000 AI-optimized cores, and 44 gigabytes of on-chip SRAM memory.

The result is inference speed that consistently outperforms every alternative. On large models (70 billion+ parameters), Cerebras delivers 2 to 6 times faster token generation than both Groq's LPU and NVIDIA's Blackwell GPUs.

💡Key Concept

Wafer-Scale Engine (WSE): Instead of cutting a silicon wafer into hundreds of individual chips, Cerebras uses the entire wafer as a single processor. This eliminates the communication bottlenecks between separate chips, allowing data to flow across 900,000 cores without leaving the chip.

Major Partnerships (2025-2026)

Cerebras has secured partnerships with three of the biggest names in AI:

  • OpenAI (January 2026): A multi-year deal to deploy 750 megawatts of Cerebras wafer-scale systems for OpenAI inference — described as the largest high-speed AI inference deployment in the world, rolling out 2026-2028
  • AWS (March 2026): WSE-3 chips coming to Amazon Bedrock, combining AWS Trainium for prefill with Cerebras CS-3 for decode. General availability expected in the second half of 2026
  • Meta (April 2025): Powers the Llama API with up to 18 times faster inference than GPU-based solutions

Speed Benchmarks

Cerebras consistently leads inference speed benchmarks, especially on larger models:

ModelCerebras SpeedGroq SpeedSpeedup
GPT-OSS-120B~3,000 tokens/sec~493 tokens/sec~6x faster
Llama 3.1 8B~1,800 tokens/sec~1,345 tokens/sec~1.3x faster
Llama 3.1 70B~450 tokens/sec~275 tokens/sec~1.6x faster
Qwen3 480B Coder~2,000 tokens/secNot availableLargest model hosted
Llama 4 Maverick~2,500+ tokens/secAvailable~2.5x faster than NVIDIA flagship

📝Note

Cerebras's advantage grows with model size. On smaller models (8 billion parameters), the gap narrows. On frontier models (100 billion+), Cerebras pulls significantly ahead because the entire model fits in on-chip SRAM, avoiding the memory bottleneck that slows down GPU-based systems.

Supported Models

As of March 2026, Cerebras hosts a focused selection of major open-source models:

ToolBest For

Pricing

Free$0
  • 1 million tokens per day
  • 8,192 context length
DeveloperPay-per-token
  • Higher limits
  • Production use
Code Pro$50/month
  • Coding-focused with discounted per-token rates
Code Max$200/month
  • High-volume coding workloads
EnterpriseCustom
  • Dedicated capacity
  • Fine-tuned models
  • SLAs

Per-token pricing (approximate):

ModelInput (per 1 million tokens)Output (per 1 million tokens)
Llama 3.1 8B$0.10$0.10
Llama 3.1 70B$0.60$0.60
Llama 3.1 405B$6.00$12.00

The free tier offering 1 million tokens per day is one of the most generous in the industry — enough for meaningful experimentation without a credit card.

WSE-3 vs. NVIDIA Blackwell

SpecCerebras WSE-3NVIDIA B200
Transistors4 trillion208 billion
Cores900,000 AI cores18,432 CUDA + 576 Tensor
On-chip memory44 GB SRAM192 GB HBM3e
AI compute125 petaFLOPS~4.5 petaFLOPS
Best forInference (speed leader)Training + inference (flexibility)

⚠️Warning

Raw specs do not tell the full story. NVIDIA's ecosystem (CUDA, cuDNN, TensorRT) supports virtually any model and workload. Cerebras excels at inference speed but has a narrower model catalog and does not support custom fine-tuning through its API yet.

💡Key Concept

Cerebras's role in the three-way inference shift. Stratechery's Ben Thompson argued in his May 11, 2026 piece "The Inference Shift" that AI compute is bifurcating into three workload categories that need fundamentally different hardware: training (GPUs win on bandwidth + ecosystem), answer inference (where token speed for human-facing chat matters most), and agentic inference (where humans aren't in the loop and memory capacity + cost-per-token matter more than raw speed). Cerebras's WSE-3 is positioned by Thompson as the canonical "answer inference" play — 21 petabytes per second of on-chip SRAM bandwidth versus 3.35 terabytes per second of HBM on the NVIDIA H100. When the next response in a conversation is what the user is waiting on, that bandwidth gap becomes practical latency. For agentic workloads where the model is making many tool calls without a human watching, the framework predicts cost-optimized memory-heavy hardware — possibly using slower, cheaper DRAM — will out-economize either GPUs or Cerebras-class speed silicon.

Company Details

DetailInfo
Founded2016
CEOAndrew Feldman
HeadquartersSunnyvale, California
Employees~750-800
Latest Funding$1 billion Series H (February 2026)
Market Cap$66 billion (post-IPO close)
Total Raised~$2.9 billion private + $5.5 billion IPO proceeds
Key InvestorsTiger Global (lead); Benchmark; Fidelity; AMD; Coatue
IPODebuted May 14, 2026 — 28 million shares priced at $185 (above $115 to $160 range), $5.5 billion raised, stock more than doubled on debut to close at $311 for $66 billion market cap
Notable CustomersOpenAI; AWS; Meta; Group 42; Saudi MBZUAI; Mistral; Perplexity; Mayo Clinic; US Department of Energy
Websitecerebras.ai

Strengths

  • Fastest inference on large models — 2 to 6 times faster than Groq and NVIDIA on 70 billion+ parameter models
  • Wafer-scale architecture — 4 trillion transistors on a single chip eliminates inter-chip communication bottlenecks
  • Major partnerships — OpenAI (750 megawatt deployment), AWS (Bedrock integration), Meta (Llama API)
  • Generous free tier — 1 million tokens per day at no cost
  • Frontier model support — runs models up to 480 billion parameters (Qwen3 480B Coder)

Limitations and Considerations

  • Narrower model catalog — fewer models than Together AI or GPU cloud providers; focused on major open-source models
  • No custom fine-tuning via API — you cannot upload or fine-tune your own models (unlike Together AI or AWS SageMaker)
  • Inference only — Cerebras Cloud does not offer model training (though on-premise CS-3 systems support training)
  • AWS integration not yet live — Bedrock availability announced for H2 2026 but not generally available yet
  • Ecosystem maturity — NVIDIA's CUDA ecosystem is vastly more developed; Cerebras is still building out developer tooling

IPO and Public-Market Debut

Cerebras Systems completed the largest US tech IPO of 2026, pricing 28 million shares at $185 — above the $115 to $160 range — and raising roughly $5.5 billion in proceeds. The stock more than doubled on debut Thursday, opening at $385 and closing at $311 for a $66 billion market cap. The S-1 named OpenAI, Group 42, Saudi Arabia's MBZUAI (Mohamed bin Zayed University of Artificial Intelligence), and Amazon Web Services as top customers, and Cerebras swung to profitability on $510 million of 2025 revenue. OpenAI is one of Cerebras's largest customers under a multi-year contract worth more than $10 billion and holds a $1 billion secured loan plus warrants for over 33 million shares — making OpenAI a meaningful post-listing shareholder.

The successful debut validates the thesis that frontier labs will pay a premium for Nvidia alternatives when inference economics work, and gives Cerebras a public-market currency to chase Nvidia's data-center share more aggressively. It also positions Cerebras as the first AI-specialized silicon company to reach a frontier-scale public listing and serves as a practical test of Nvidia's GPU pricing power and frontier-lab compute lock-in. Ben Thompson's "Inference Shift" framework — see the InfoBox above — helps frame the answer-inference workload bet behind investor demand: when the next response in a conversation is what a user is waiting on, the on-chip SRAM bandwidth gap between WSE-3 and HBM-based GPUs becomes practical latency.

Key Takeaways

  • Cerebras Inference delivers the fastest AI inference available, powered by the WSE-3 — the world's largest chip with 4 trillion transistors
  • Speed advantage grows with model size: 6 times faster than Groq on 120 billion parameter models, with support for models up to 480 billion parameters
  • Major 2026 partnerships with OpenAI (750 megawatt deployment) and AWS (Bedrock integration) validate the technology at massive scale
  • Free tier offers 1 million tokens per day — ideal for experimentation; the company completed the largest US tech IPO of 2026 on May 14, pricing 28 million shares at $185 (above the $115 to $160 range) for $5.5 billion raised, with the stock more than doubling on debut to close at $311 for a $66 billion market cap and OpenAI a meaningful post-listing shareholder

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you