Name: Groq Cloud
Availability: InStock
Author: Groq

Learning Objectives

Understand what Groq Cloud is and how its LPU architecture differs from GPUs
Evaluate Groq's pricing, supported models, and use cases
Assess the impact of NVIDIA's 2025 licensing deal on Groq's future

What Is Groq Cloud?

Groq Cloud (also called GroqCloud) is an AI inference platform that runs large language models on custom-designed Language Processing Units (LPUs) instead of traditional GPUs. The result: dramatically faster token generation — often 10 to 30 times faster than GPU-based alternatives for single-stream inference.

Founded in 2016 by Jonathan Ross (the architect behind Google's original TPU), Groq designed its chips from scratch to solve one problem: making AI inference as fast as possible. While GPUs excel at training models, Groq's LPU architecture is purpose-built for running them.

💡Key Concept

Language Processing Unit (LPU): Groq's custom chip uses SRAM (not HBM memory like GPUs) and deterministic execution — the compiler pre-computes the entire execution graph down to the clock cycle, eliminating the unpredictable memory bottlenecks that slow down GPU inference.

The NVIDIA Deal and the Inference-Cloud Pivot

In December 2025, NVIDIA entered a non-exclusive licensing agreement for Groq's inference technology, paying approximately $20 billion. The deal brought Groq founder Jonathan Ross and the bulk of senior chip-engineering leadership to NVIDIA, while GroqCloud was explicitly excluded and continues operating independently. At GTC 2026 in March, NVIDIA unveiled the Groq 3 LPU built on the licensed intellectual property — a clean separation of the chip lineage (now at NVIDIA) from the cloud service (still under Groq), with the Groq 3 expected to ship in late 2026.

The remaining Groq is now run as an inference-cloud company under interim CEO Adam Winter and interim CFO Matt Eng — both formerly senior Groq finance and operations leaders. The strategic refocus is sharp: Groq is no longer pitching as a frontier chip-design company chasing the next process node, but as an inference-cloud operator running the existing LPU fleet plus collecting royalties on the NVIDIA license.

The pitch is straightforward — inference is now a much larger market than training, and the customer-facing inference category (real-time voice, agentic browser control, low-latency function calling) is where Groq's LPU advantage on tokens-per-second economics remains intact after the chip lineage moved.

Groq is raising roughly $650 million to fund this pivot. Existing investors are leading the round, with Disruptive and Infinitium committed to fill any unsubscribed shares. Cumulative funding now exceeds $2 billion, building on the $750 million round closed in September 2025 at a $6.9 billion valuation.

⚠️Warning

The competitive question has shifted. Pre-NVIDIA-deal, the worry about building on GroqCloud was that the underlying chip lineage might lose its independent edge. Post-pivot, the question is whether Groq can hold the customer-facing inference category as NVIDIA (with the new Groq 3 lineage), AWS Bedrock, Cerebras, and Together AI all push their own inference services using the same or similar chip generations. The token-per-second economics still favor Groq for real-time and agentic workloads — but the moat is operational and customer-facing now, not silicon-only.

Speed Benchmarks

Groq's headline feature is raw inference speed:

Model	Groq Speed (tokens/sec)	Typical GPU Speed	Speedup
Llama 3.1 8B	~1,345 tok/sec	~100-200 tok/sec	7-13x faster
Qwen 3 32B	~662 tok/sec	~50-100 tok/sec	7-13x faster
Llama 2 70B	~300 tok/sec (single card)	~30-60 tok/sec	5-10x faster

These speeds make Groq particularly compelling for real-time chat, agentic AI (where models make many sequential calls), and interactive applications where latency matters more than throughput.

Supported Models

As of March 2026, GroqCloud hosts a curated selection of open-source models:

Tool	Best For
Llama 3.3 70B	Main workhorse model for general-purpose tasks
GPT-OSS 120B	Reasoning flagship; newest addition
Qwen 3 32B	Efficient multilingual model; replaced Mistral Saba
DeepSeek R1 Distill 70B	Reasoning-optimized distilled model
Llama 4 Scout 17B	Latest Llama generation; mixture-of-experts
Llama 3.1 8B Instant	Fastest and cheapest option for simple tasks

The API is OpenAI-compatible — switching from OpenAI to Groq requires changing just the base URL and API key. No code rewrite needed.

Pricing

Plan	Price	Features
Free	$0 (no credit card)	Experimentation and prototyping Rate-limited
Developer	Pay-as-you-go	Production applications with higher rate limits
Enterprise	Custom pricing	Dedicated capacity and SLAs

Free$0 (no credit card)

Experimentation and prototyping
Rate-limited

DeveloperPay-as-you-go

Production applications with higher rate limits

EnterpriseCustom pricing

Dedicated capacity and SLAs

Per-token pricing (approximate):

Model Size	Input (per 1 million tokens)	Output (per 1 million tokens)
Small (8-17 billion params)	~$0.11	~$0.11
Mid (32-70 billion params)	~$0.59-$0.99	~$0.79
Large (120 billion+ params)	~$1.00	Higher

Cost-saving features: Batch processing saves 50% on input costs. Prompt caching gives 50% off cached input tokens.

Groq vs. Competitors

Platform	Strength	Limitation
Groq Cloud	Fastest single-stream latency; purpose-built LPU silicon	No training; model selection limited to hosted open-source
Cerebras Inference	Higher throughput on frontier models (3,000+ tok/sec on 120 billion param models)	Smaller model catalog; less developer tooling
Together AI	More model variety; fine-tuning and custom training	Runs on rented GPUs; higher latency
AWS Bedrock	Broadest ecosystem; managed services; enterprise trust	Slower inference; higher cost per token

Company Details

Detail	Info
Founded	2016 by Jonathan Ross (ex-Google TPU architect; moved to NVIDIA December 2025)
Interim CEO	Adam Winter (former senior Groq operations leader)
Interim CFO	Matt Eng
Headquarters	Mountain View, California
Current Round	Roughly $650 million (in progress, May 2026)
Last Closed Funding	$750 million (September 2025) at $6.9 billion valuation
Total Raised	Over $2 billion (cumulative)
Developers	Over 2 million signups; 360,000+ active monthly
Fortune 100	75% have GroqCloud accounts
Data Centers	12+ across US, Canada, Middle East, and Europe
Website	groq.com

Strengths

Fastest inference for real-time applications — 10 to 30 times faster than GPU-based alternatives on single-stream queries
Free tier with no credit card required — easy to experiment
OpenAI-compatible API — switch from OpenAI with a 3-line code change
Purpose-built silicon — LPU architecture designed specifically for inference, not repurposed training hardware
Cost-competitive — batch processing and prompt caching reduce costs significantly

Limitations and Considerations

Inference only — Groq cannot train or fine-tune models; it only runs pre-trained models
Limited model selection — only hosts a curated set of open-source models (no proprietary models like GPT or Claude)
NVIDIA deal uncertainty — with founder and IP at NVIDIA, GroqCloud's long-term independence is an open question
Throughput vs. latency — Groq excels at single-stream speed but Cerebras outperforms on high-batch throughput workloads
No custom models — you cannot upload or deploy your own fine-tuned models (unlike Together AI or AWS)

Key Takeaways

Groq Cloud delivers the fastest single-stream AI inference available, powered by custom LPU chips purpose-built for running language models
The free tier and OpenAI-compatible API make it easy to try — switch from OpenAI by changing just the base URL and API key
NVIDIA's $20 billion December 2025 licensing deal brought Groq's founder and senior chip leadership to NVIDIA; GroqCloud was explicitly excluded and has refocused as an inference-cloud company under interim CEO Adam Winter and interim CFO Matt Eng
A roughly $650 million funding round is in progress to fund the pivot, led by existing investors with Disruptive and Infinitium committed to fill any unsubscribed shares — taking cumulative funding past $2 billion
The strategic bet is that inference is now a much larger market than training and that real-time / agentic workloads (where tokens-per-second economics matter most) remain Groq's strongest competitive ground
Best suited for real-time chat, agentic AI workflows, voice and browser-control products, and latency-sensitive applications where speed matters more than throughput

Groq Cloud

Audio & video lessons are paid features