Learning Objectives
- Understand what Groq Cloud is and how its LPU architecture differs from GPUs
- Evaluate Groq's pricing, supported models, and use cases
- Assess the impact of NVIDIA's 2025 licensing deal on Groq's future
What Is Groq Cloud?
Groq Cloud (also called GroqCloud) is an AI inference platform that runs large language models on custom-designed Language Processing Units (LPUs) instead of traditional GPUs. The result: dramatically faster token generation — often 10 to 30 times faster than GPU-based alternatives for single-stream inference.
Founded in 2016 by Jonathan Ross (the architect behind Google's original TPU), Groq designed its chips from scratch to solve one problem: making AI inference as fast as possible. While GPUs excel at training models, Groq's LPU architecture is purpose-built for running them.
💡Key Concept
Language Processing Unit (LPU): Groq's custom chip uses SRAM (not HBM memory like GPUs) and deterministic execution — the compiler pre-computes the entire execution graph down to the clock cycle, eliminating the unpredictable memory bottlenecks that slow down GPU inference.
The NVIDIA Deal and the Inference-Cloud Pivot
In December 2025, NVIDIA entered a non-exclusive licensing agreement for Groq's inference technology, paying approximately $20 billion. The deal brought Groq founder Jonathan Ross and the bulk of senior chip-engineering leadership to NVIDIA, while GroqCloud was explicitly excluded and continues operating independently. At GTC 2026 in March, NVIDIA unveiled the Groq 3 LPU built on the licensed intellectual property — a clean separation of the chip lineage (now at NVIDIA) from the cloud service (still under Groq), with the Groq 3 expected to ship in late 2026.
The remaining Groq is now run as an inference-cloud company under interim CEO Adam Winter and interim CFO Matt Eng — both formerly senior Groq finance and operations leaders. The strategic refocus is sharp: Groq is no longer pitching as a frontier chip-design company chasing the next process node, but as an inference-cloud operator running the existing LPU fleet plus collecting royalties on the NVIDIA license.
The pitch is straightforward — inference is now a much larger market than training, and the customer-facing inference category (real-time voice, agentic browser control, low-latency function calling) is where Groq's LPU advantage on tokens-per-second economics remains intact after the chip lineage moved.
Groq is raising roughly $650 million to fund this pivot. Existing investors are leading the round, with Disruptive and Infinitium committed to fill any unsubscribed shares. Cumulative funding now exceeds $2 billion, building on the $750 million round closed in September 2025 at a $6.9 billion valuation.
⚠️Warning
The competitive question has shifted. Pre-NVIDIA-deal, the worry about building on GroqCloud was that the underlying chip lineage might lose its independent edge. Post-pivot, the question is whether Groq can hold the customer-facing inference category as NVIDIA (with the new Groq 3 lineage), AWS Bedrock, Cerebras, and Together AI all push their own inference services using the same or similar chip generations. The token-per-second economics still favor Groq for real-time and agentic workloads — but the moat is operational and customer-facing now, not silicon-only.
Speed Benchmarks
Groq's headline feature is raw inference speed:
| Model | Groq Speed (tokens/sec) | Typical GPU Speed | Speedup |
|---|---|---|---|
| Llama 3.1 8B | ~1,345 tok/sec | ~100-200 tok/sec | 7-13x faster |
| Qwen 3 32B | ~662 tok/sec | ~50-100 tok/sec | 7-13x faster |
| Llama 2 70B | ~300 tok/sec (single card) | ~30-60 tok/sec | 5-10x faster |
These speeds make Groq particularly compelling for real-time chat, agentic AI (where models make many sequential calls), and interactive applications where latency matters more than throughput.
Supported Models
As of March 2026, GroqCloud hosts a curated selection of open-source models:
| Tool | Best For |
|---|
The API is OpenAI-compatible — switching from OpenAI to Groq requires changing just the base URL and API key. No code rewrite needed.
Pricing
- Experimentation and prototyping
- Rate-limited
- Production applications with higher rate limits
- Dedicated capacity and SLAs
Per-token pricing (approximate):
| Model Size | Input (per 1 million tokens) | Output (per 1 million tokens) |
|---|---|---|
| Small (8-17 billion params) | ~$0.11 | ~$0.11 |
| Mid (32-70 billion params) | ~$0.59-$0.99 | ~$0.79 |
| Large (120 billion+ params) | ~$1.00 | Higher |
Cost-saving features: Batch processing saves 50% on input costs. Prompt caching gives 50% off cached input tokens.
Groq vs. Competitors
| Platform | Strength | Limitation |
|---|---|---|
| Groq Cloud | Fastest single-stream latency; purpose-built LPU silicon | No training; model selection limited to hosted open-source |
| Cerebras Inference | Higher throughput on frontier models (3,000+ tok/sec on 120 billion param models) | Smaller model catalog; less developer tooling |
| Together AI | More model variety; fine-tuning and custom training | Runs on rented GPUs; higher latency |
| AWS Bedrock | Broadest ecosystem; managed services; enterprise trust | Slower inference; higher cost per token |
Company Details
| Detail | Info |
|---|---|
| Founded | 2016 by Jonathan Ross (ex-Google TPU architect; moved to NVIDIA December 2025) |
| Interim CEO | Adam Winter (former senior Groq operations leader) |
| Interim CFO | Matt Eng |
| Headquarters | Mountain View, California |
| Current Round | Roughly $650 million (in progress, May 2026) |
| Last Closed Funding | $750 million (September 2025) at $6.9 billion valuation |
| Total Raised | Over $2 billion (cumulative) |
| Developers | Over 2 million signups; 360,000+ active monthly |
| Fortune 100 | 75% have GroqCloud accounts |
| Data Centers | 12+ across US, Canada, Middle East, and Europe |
| Website | groq.com |
Strengths
- Fastest inference for real-time applications — 10 to 30 times faster than GPU-based alternatives on single-stream queries
- Free tier with no credit card required — easy to experiment
- OpenAI-compatible API — switch from OpenAI with a 3-line code change
- Purpose-built silicon — LPU architecture designed specifically for inference, not repurposed training hardware
- Cost-competitive — batch processing and prompt caching reduce costs significantly
Limitations and Considerations
- Inference only — Groq cannot train or fine-tune models; it only runs pre-trained models
- Limited model selection — only hosts a curated set of open-source models (no proprietary models like GPT or Claude)
- NVIDIA deal uncertainty — with founder and IP at NVIDIA, GroqCloud's long-term independence is an open question
- Throughput vs. latency — Groq excels at single-stream speed but Cerebras outperforms on high-batch throughput workloads
- No custom models — you cannot upload or deploy your own fine-tuned models (unlike Together AI or AWS)
Key Takeaways
- Groq Cloud delivers the fastest single-stream AI inference available, powered by custom LPU chips purpose-built for running language models
- The free tier and OpenAI-compatible API make it easy to try — switch from OpenAI by changing just the base URL and API key
- NVIDIA's $20 billion December 2025 licensing deal brought Groq's founder and senior chip leadership to NVIDIA; GroqCloud was explicitly excluded and has refocused as an inference-cloud company under interim CEO Adam Winter and interim CFO Matt Eng
- A roughly $650 million funding round is in progress to fund the pivot, led by existing investors with Disruptive and Infinitium committed to fill any unsubscribed shares — taking cumulative funding past $2 billion
- The strategic bet is that inference is now a much larger market than training and that real-time / agentic workloads (where tokens-per-second economics matter most) remain Groq's strongest competitive ground
- Best suited for real-time chat, agentic AI workflows, voice and browser-control products, and latency-sensitive applications where speed matters more than throughput