Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
10 min read·Updated March 24, 2026

AI Hardware & Chips

The AI chip landscape spans NVIDIA's dominant GPU ecosystem, AMD's memory-rich challengers, Apple's unified silicon for local AI, and a growing array of custom ASICs from Google, AWS, and Cerebras — each with distinct tradeoffs in performance, cost, and ecosystem.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Identify the major AI chip families and their primary use cases
  • Explain NVIDIA's software ecosystem advantage beyond raw hardware performance
  • Compare custom ASICs (TPUs, Trainium, Cerebras) with general-purpose GPUs for specific workloads

Why Hardware Matters for Developers

For developers calling APIs, the hardware is abstracted — you never see the GPU your request runs on. But hardware literacy matters for several reasons:

  • Local model inference: Running Llama or DeepSeek locally requires choosing compatible hardware
  • Cost reasoning: Understanding why inference costs what it does helps make informed build-vs-buy decisions
  • Cloud instance selection: Choosing AWS GPU instances (P4d, P5) vs. Trainium vs. Inferentia requires understanding the tradeoffs
  • Future planning: Hardware trajectories determine what AI capabilities will be affordable in 12-24 months
ChipMakerMemoryBest Use Case
H100 SXMNVIDIA80GB HBM3Training and inference; maximum CUDA compatibility
H200 SXMNVIDIA141GB HBM3eLarge model inference; drop-in H100 upgrade with faster memory
GB200 NVL72NVIDIA720GB total (72 GPUs unified)Ultra-large model training; inference at datacenters
GB300 Blackwell UltraNVIDIA576GB HBM3e (per superchip)Next-gen training and inference; enhanced FP4/FP6; bridge to Vera Rubin
AMD MI300XAMD192GB HBM3Memory-intensive inference; very large models; cost competitive
Apple M4 MaxApple128GB unified (shared CPU/GPU)Local model inference on macOS; privacy-first development
Google TPU Trillium v6GoogleVariesJAX-based model training; Gemini derivatives; cost efficient on GCP
Cerebras WSE-3Cerebras44GB on-chip SRAMUltra-fast inference; 1,000+ tokens/sec; unusual architecture
AWS Trainium3AmazonVariesTraining cost optimization on AWS; 40% cheaper than GPU for supported models

NVIDIA — The Dominant Ecosystem

NVIDIA holds approximately 80% of AI training and inference compute market share. Understanding why reveals as much about software as hardware.

The Hardware: H100 to Blackwell Ultra

H100 SXM (Hopper architecture, 2022-2024): The workhorse GPU of the current AI era.

  • 80GB HBM3 memory; 3.35 petaflops FP16 performance
  • NVLink interconnect for multi-GPU communication
  • Market price: $25,000-$35,000 per GPU; cloud instance price: $2-3/hour

H200 SXM (2024-2025): Evolutionary improvement over H100.

  • 141GB HBM3e — 1.75x the H100's memory, with faster bandwidth
  • Same Hopper GPU die as H100; the upgrade is primarily the memory
  • Drop-in compatibility with H100 infrastructure

GB200 Blackwell NVL72 (2025): NVIDIA's rack-scale flagship.

  • 72 Grace CPUs + 72 Blackwell GPUs interconnected as a single logical unit
  • 720GB of total HBM3e memory shared across the system
  • Designed for inference of very large models (405 billion+ parameter models) and next-generation training runs
  • Not a single server — a rack-scale unit with liquid cooling

GB300 Blackwell Ultra (announced GTC 2025): Enhanced Blackwell variant.

  • Pairs two upgraded Blackwell Ultra GPUs with one Grace CPU via NVLink-C2C
  • 576GB HBM3e per superchip (up from 384GB on GB200)
  • Enhanced FP4 and FP6 performance for both training and inference
  • Bridge product between Blackwell and the next-generation Vera Rubin architecture

The Roadmap: Vera Rubin (Shipping H2 2026)

Vera Rubin NVL72 is now in production and shipping H2 2026 — NVIDIA's next-generation architecture following Blackwell:

  • HBM4 memory — 2.8x faster memory bandwidth than Blackwell
  • 10x inference cost reduction vs. Blackwell for equivalent workloads
  • Pairs with next-gen Grace CPU (Vera CPU)
  • Part of NVIDIA's annual architecture cadence: Hopper (2022) → Blackwell (2024) → Vera Rubin (2026)
  • Rubin Ultra variant on the roadmap for 2027

This annual cadence is a key competitive advantage — each generation delivers roughly 2x the performance of the prior, keeping NVIDIA ahead of competitors who update on longer cycles.

Consumer GPUs: RTX 5090

The RTX 5090 (released January 2025, $1,999 MSRP) is NVIDIA's flagship consumer GPU:

  • 32GB GDDR7 memory — up from the RTX 4090's 24GB GDDR6X
  • Significant performance improvement for local model inference
  • Best consumer option for running quantized 30 billion-70 billion parameter models locally

NVIDIA DGX Spark is now shipping at $4,699 (raised from the original $3,999 announcement price) — a desktop AI workstation with 128GB unified memory and 1 PFLOPS of AI performance.

The Software Moat: CUDA and NIM

CUDA is NVIDIA's parallel computing platform — a programming model and compiler toolchain for GPU programming. Every major AI framework (PyTorch, TensorFlow, JAX) is deeply optimized for CUDA. Every major AI model is trained, profiled, and optimized on CUDA hardware.

This ecosystem has been built over 20 years. The gap isn't just current performance — it's documentation, community answers, optimized kernels, and production code that all assumes CUDA. Switching to AMD or other hardware requires adapting or rewriting significant portions of this stack.

NVIDIA NIM (Neural Inference Microservices) simplifies deployment: pre-optimized Docker containers that package LLMs with TensorRT-LLM optimization. Developers can deploy optimized Llama, Mistral, and other models with a single docker pull. Free for development; enterprise license for production.

TensorRT-LLM is the open-source engine behind NIM's performance — automatic batching, quantization (FP8, INT4), KV cache optimization, and multi-GPU scheduling for production inference.

The CUDA moat is why NVIDIA's pricing power persists even as the hardware becomes more expensive. Customers are buying into the full ecosystem — hardware, software, and developer tools. See the full NVIDIA company profile for more on their product lineup.

AMD — The Memory-Rich Challenger

AMD's MI300X has found real traction specifically for large model inference:

192GB of HBM3 per GPU — 2.4x the memory of the H100's 80GB. For running very large models (Llama 70 billion+, Mixtral 8x7 billion), being able to fit more of the model in GPU memory reduces costly memory bandwidth bottlenecks.

Large cloud providers (Azure, Meta's own infrastructure) have deployed MI300X at scale for inference workloads where the memory advantage justifies the ROCm software migration cost.

ROCm and HIP: AMD's answer to CUDA. HIP (Heterogeneous-computing Interface for Portability) allows porting CUDA code to ROCm with minimal changes for common operations. Major frameworks support ROCm, but the ecosystem depth still trails CUDA.

MI350 (shipped Q3 2025): 288GB HBM3e — AMD's next-generation, competing directly with the H200. Performance parity on many benchmarks; ROCm has improved to v7.0-7.2 but still trails CUDA in ecosystem depth. MI400 has been previewed as the next generation.

Apple Silicon — Local AI on Mac

Apple's M-series chips have evolved to the M5 generation (released October 2025), featuring GPU Neural Accelerators with 4x AI performance improvement over M4. The M4 and M5 families (Pro, Max, Ultra variants) have made MacBook Pro, Mac Studio, and Mac Pro compelling platforms for local AI inference:

Unified memory architecture: CPU and GPU share the same memory pool — up to 128GB on M4 Max and 256GB on M4 Ultra. No memory copying between CPU and GPU (a bottleneck on discrete GPU systems). For models that fit in unified memory, this is highly efficient.

MLX: Apple's open-source machine learning framework optimized for M-series. Model inference on MLX on M4 Max is often faster than equivalent CPU inference and competitive with consumer NVIDIA GPUs. The mlx-community on Hugging Face hosts hundreds of pre-converted model weights optimized for MLX.

Practical local models on M4 Max (128GB):

  • Llama 4 Scout: fits comfortably; full-quality inference
  • DeepSeek R1 7 billion/14 billion: excellent performance
  • Mistral 7 billion: fast, practical for daily use
  • Gemma 3 27 billion: fits in 128GB; high quality

Apple Intelligence is Apple's privacy-first AI system built into iOS 18+, iPadOS, and macOS — featuring Writing Tools, Image Playground, Genmoji, and enhanced Siri. Most processing runs on-device via the Neural Engine; complex tasks use Private Cloud Compute, Apple's confidential cloud infrastructure running on Apple Silicon servers.

For developers who need a powerful portable workstation for local AI development, M4 Max MacBook Pro is currently the best consumer option. See the full Apple company profile for more on their AI tools.

Custom ASICs: The Hyperscaler Strategy

The largest cloud providers are building their own AI chips to reduce NVIDIA dependence for internal workloads:

Google TPUs (Ironwood v7)

TPUs are optimized for Google's JAX framework and the model architectures Google uses internally. The latest Ironwood v7 generation (GA 2026) achieves 42.5 Exaflops at 9,216 chips — a massive leap from Trillium v6. Anthropic committed to over 1 million Ironwood chips for Claude training and inference.

Available on Google Cloud; best for teams using JAX-based training or deploying Google's models on GCP.

Cerebras WSE-3 — The Unusual Architecture

Cerebras WSE-3 is architecturally distinct: instead of multiple chips connected via NVLink, WSE-3 is a single wafer-scale chip — the entire silicon wafer is one chip.

  • 900,000 AI cores on a single die
  • 44GB of on-chip SRAM — memory directly on the chip, not HBM stacked alongside it
  • Result: extremely high memory bandwidth (on-chip SRAM is much faster than HBM)
  • 2,100 tokens per second inference (8x faster than H200) — this is what powers GPT-5.3-Codex-Spark's real-time performance

Cerebras is available as a cloud service and is targeting an IPO in Q2 2026 at an estimated ~$22 billion valuation. The tradeoff: limited software compatibility (not general-purpose CUDA), but for applications where inference speed is the primary requirement, WSE-3 is in a different performance class.

AWS Trainium3 and Inferentia2

AWS's custom chips are optimized for the AWS ecosystem:

  • Trainium3 (EC2 Trn2): training chips; ~40% cost reduction vs. equivalent NVIDIA for supported models (Llama, BERT variants)
  • Inferentia2 (EC2 Inf2): inference chips; ~60% cost reduction for supported models; up to 2.3 petaflops

Neither Trainium nor Inferentia supports the full CUDA ecosystem — they use AWS's own Neuron compiler. For teams on AWS running high-volume inference on common models, the cost savings justify the adaptation. For custom architectures or research workloads, NVIDIA remains simpler.

Microsoft Maia 200

Microsoft's Maia 200 is used internally for Azure's AI inference workloads — specifically Copilot and Azure OpenAI Service responses. It reduces Microsoft's NVIDIA dependence for predictable, high-volume inference. Not a cloud product customers can purchase directly.

Memory Technology Deep Dive

The memory attached to AI chips is as important as the compute:

HBM (High Bandwidth Memory): The standard for AI chips. Multiple DRAM dies stacked vertically and bonded directly to the chip package. Current standard: HBM3e (H200, MI300X, MI350). HBM4 entered production in February 2026 (SK Hynix first, Samsung Q2 2026), enabling the Vera Rubin architecture. Only three suppliers: SK Hynix, Samsung, Micron.

GDDR7: Consumer GPU memory (gaming cards). Lower bandwidth than HBM, dramatically cheaper. Used in consumer GPUs that also run local AI models (RTX 4090: 24GB GDDR6X; RTX 5090: GDDR7).

On-chip SRAM: The fastest memory, directly on the chip die. Limited by chip area. Used for KV cache and active computation — the bottleneck for inference speed in many models.

CXL (Compute Express Link): Emerging PCIe-based standard for memory pooling and disaggregation — enables connecting additional memory to servers beyond what's physically on the GPU. Important for next-generation AI systems that need to exceed single-server memory limits.

Key Takeaways

  • NVIDIA dominates AI compute (~80% market share) with hardware performance and the CUDA software ecosystem — 20 years of ecosystem depth is harder to compete with than chip specs
  • AMD MI300X is the strongest NVIDIA alternative, particularly for memory-intensive large model inference (192GB vs. H100's 80GB); ROCm ecosystem is improving but still trails CUDA
  • Apple M5 (October 2025) delivers 4x AI performance over M4 with GPU Neural Accelerators; M4/M5 Max with 128GB unified memory enables running Llama 70 billion locally
  • Custom ASICs are maturing fast: Google TPU Ironwood v7 (42.5 Exaflops), Cerebras WSE-3 (2,100 tok/s, IPO Q2 2026), AWS Trainium3 GA (3nm) — all optimize for specific workloads at lower cost
  • HBM4 entered production (February 2026), enabling the next generation of GPU architectures (NVIDIA Vera Rubin, shipping H2 2026)

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you