9.10 — AI Hardware & Chips

Learning Objectives

Identify the major AI chip families and their primary use cases
Explain NVIDIA's software ecosystem advantage beyond raw hardware performance
Compare custom ASICs (TPUs, Trainium, Cerebras) with general-purpose GPUs for specific workloads

Why Hardware Matters for Developers

For developers calling APIs, the hardware is abstracted — you never see the GPU your request runs on. But hardware literacy matters for several reasons:

Local model inference: Running Llama or DeepSeek locally requires choosing compatible hardware
Cost reasoning: Understanding why inference costs what it does helps make informed build-vs-buy decisions
Cloud instance selection: Choosing AWS GPU instances (P4d, P5) vs. Trainium vs. Inferentia requires understanding the tradeoffs
Future planning: Hardware trajectories determine what AI capabilities will be affordable in 12-24 months

Chip	Maker	Memory	Best Use Case
H100 SXM	NVIDIA	80GB HBM3	Training and inference; maximum CUDA compatibility
H200 SXM	NVIDIA	141GB HBM3e	Large model inference; drop-in H100 upgrade with faster memory
GB200 NVL72	NVIDIA	720GB total (72 GPUs unified)	Ultra-large model training; inference at datacenters
GB300 Blackwell Ultra	NVIDIA	576GB HBM3e (per superchip)	Next-gen training and inference; enhanced FP4/FP6; bridge to Vera Rubin
AMD MI300X	AMD	192GB HBM3	Memory-intensive inference; very large models; cost competitive
Apple M4 Max	Apple	128GB unified (shared CPU/GPU)	Local model inference on macOS; privacy-first development
Google TPU Trillium v6	Google	Varies	JAX-based model training; Gemini derivatives; cost efficient on GCP
Cerebras WSE-3	Cerebras	44GB on-chip SRAM	Ultra-fast inference; 1,000+ tokens/sec; unusual architecture
AWS Trainium3	Amazon	Varies	Training cost optimization on AWS; 40% cheaper than GPU for supported models

NVIDIA — The Dominant Ecosystem

NVIDIA holds approximately 80% of AI training and inference compute market share. Understanding why reveals as much about software as hardware.

The Hardware: H100 to Blackwell Ultra

H100 SXM (Hopper architecture, 2022-2024): The workhorse GPU of the current AI era.

80GB HBM3 memory; 3.35 petaflops FP16 performance
NVLink interconnect for multi-GPU communication
Market price: $25,000-$35,000 per GPU; cloud instance price: $2-3/hour

H200 SXM (2024-2025): Evolutionary improvement over H100.

141GB HBM3e — 1.75x the H100's memory, with faster bandwidth
Same Hopper GPU die as H100; the upgrade is primarily the memory
Drop-in compatibility with H100 infrastructure

GB200 Blackwell NVL72 (2025): NVIDIA's rack-scale flagship.

72 Grace CPUs + 72 Blackwell GPUs interconnected as a single logical unit
720GB of total HBM3e memory shared across the system
Designed for inference of very large models (405 billion+ parameter models) and next-generation training runs
Not a single server — a rack-scale unit with liquid cooling

GB300 Blackwell Ultra (announced GTC 2025): Enhanced Blackwell variant.

Pairs two upgraded Blackwell Ultra GPUs with one Grace CPU via NVLink-C2C
576GB HBM3e per superchip (up from 384GB on GB200)
Enhanced FP4 and FP6 performance for both training and inference
Bridge product between Blackwell and the next-generation Vera Rubin architecture

The Roadmap: Vera Rubin (Shipping H2 2026)

Vera Rubin NVL72 is now in production and shipping H2 2026 — NVIDIA's next-generation architecture following Blackwell:

HBM4 memory — 2.8x faster memory bandwidth than Blackwell
10x inference cost reduction vs. Blackwell for equivalent workloads
Pairs with next-gen Grace CPU (Vera CPU)
Part of NVIDIA's annual architecture cadence: Hopper (2022) → Blackwell (2024) → Vera Rubin (2026)
Rubin Ultra variant on the roadmap for 2027

This annual cadence is a key competitive advantage — each generation delivers roughly 2x the performance of the prior, keeping NVIDIA ahead of competitors who update on longer cycles.

Consumer GPUs: RTX 5090

The RTX 5090 (released January 2025, $1,999 MSRP) is NVIDIA's flagship consumer GPU:

32GB GDDR7 memory — up from the RTX 4090's 24GB GDDR6X
Significant performance improvement for local model inference
Best consumer option for running quantized 30 billion-70 billion parameter models locally

NVIDIA DGX Spark is now shipping at $4,699 (raised from the original $3,999 announcement price) — a desktop AI workstation with 128GB unified memory and 1 PFLOPS of AI performance.

The Software Moat: CUDA and NIM

CUDA is NVIDIA's parallel computing platform — a programming model and compiler toolchain for GPU programming. Every major AI framework (PyTorch, TensorFlow, JAX) is deeply optimized for CUDA. Every major AI model is trained, profiled, and optimized on CUDA hardware.

This ecosystem has been built over 20 years. The gap isn't just current performance — it's documentation, community answers, optimized kernels, and production code that all assumes CUDA. Switching to AMD or other hardware requires adapting or rewriting significant portions of this stack.

NVIDIA NIM (Neural Inference Microservices) simplifies deployment: pre-optimized Docker containers that package LLMs with TensorRT-LLM optimization. Developers can deploy optimized Llama, Mistral, and other models with a single docker pull. Free for development; enterprise license for production.

TensorRT-LLM is the open-source engine behind NIM's performance — automatic batching, quantization (FP8, INT4), KV cache optimization, and multi-GPU scheduling for production inference.

The CUDA moat is why NVIDIA's pricing power persists even as the hardware becomes more expensive. Customers are buying into the full ecosystem — hardware, software, and developer tools. See the full NVIDIA company profile for more on their product lineup.

AMD — The Memory-Rich Challenger

AMD's MI300X has found real traction specifically for large model inference:

192GB of HBM3 per GPU — 2.4x the memory of the H100's 80GB. For running very large models (Llama 70 billion+, Mixtral 8x7 billion), being able to fit more of the model in GPU memory reduces costly memory bandwidth bottlenecks.

Large cloud providers (Azure, Meta's own infrastructure) have deployed MI300X at scale for inference workloads where the memory advantage justifies the ROCm software migration cost.

ROCm and HIP: AMD's answer to CUDA. HIP (Heterogeneous-computing Interface for Portability) allows porting CUDA code to ROCm with minimal changes for common operations. Major frameworks support ROCm, but the ecosystem depth still trails CUDA.

MI350 (shipped Q3 2025): 288GB HBM3e — AMD's next-generation, competing directly with the H200. Performance parity on many benchmarks; ROCm has improved to v7.0-7.2 but still trails CUDA in ecosystem depth. The MI400 generation launched in July 2026 as AMD's answer to NVIDIA's Vera Rubin, deployed in the new Helios rack-scale system — 72 MI455X GPUs per rack — with Microsoft, Anthropic, OpenAI, and Meta among the launch customers.

Apple Silicon — Local AI on Mac

Apple's M-series chips have evolved to the M5 generation (released October 2025), featuring GPU Neural Accelerators with 4x AI performance improvement over M4. The M4 and M5 families (Pro, Max, Ultra variants) have made MacBook Pro, Mac Studio, and Mac Pro compelling platforms for local AI inference:

Unified memory architecture: CPU and GPU share the same memory pool — up to 128GB on M4 Max and 256GB on M4 Ultra. No memory copying between CPU and GPU (a bottleneck on discrete GPU systems). For models that fit in unified memory, this is highly efficient.

MLX: Apple's open-source machine learning framework optimized for M-series. Model inference on MLX on M4 Max is often faster than equivalent CPU inference and competitive with consumer NVIDIA GPUs. The mlx-community on Hugging Face hosts hundreds of pre-converted model weights optimized for MLX.

Practical local models on M4 Max (128GB):

Llama 4 Scout: fits comfortably; full-quality inference
DeepSeek R1 7 billion/14 billion: excellent performance
Mistral 7 billion: fast, practical for daily use
Gemma 3 27 billion: fits in 128GB; high quality

Apple Intelligence is Apple's privacy-first AI system built into iOS 18+, iPadOS, and macOS — featuring Writing Tools, Image Playground, Genmoji, and enhanced Siri. Most processing runs on-device via the Neural Engine; complex tasks use Private Cloud Compute, Apple's confidential cloud infrastructure running on Apple Silicon servers.

For developers who need a powerful portable workstation for local AI development, M4 Max MacBook Pro is currently the best consumer option. See the full Apple company profile for more on their AI tools.

Custom ASICs: The Hyperscaler Strategy

The largest cloud providers are building their own AI chips to reduce NVIDIA dependence for internal workloads:

Google TPUs (Ironwood v7)

TPUs are optimized for Google's JAX framework and the model architectures Google uses internally. The latest Ironwood v7 generation (GA 2026) achieves 42.5 Exaflops at 9,216 chips — a massive leap from Trillium v6. Anthropic committed to over 1 million Ironwood chips for Claude training and inference.

Available on Google Cloud; best for teams using JAX-based training or deploying Google's models on GCP.

Cerebras WSE-3 — The Unusual Architecture

Cerebras WSE-3 is architecturally distinct: instead of multiple chips connected via NVLink, WSE-3 is a single wafer-scale chip — the entire silicon wafer is one chip.

900,000 AI cores on a single die
44GB of on-chip SRAM — memory directly on the chip, not HBM stacked alongside it
Result: extremely high memory bandwidth (on-chip SRAM is much faster than HBM)
2,100 tokens per second inference (8x faster than H200) — this is what powers GPT-5.3-Codex-Spark's real-time performance

Cerebras is available as a cloud service and is targeting an IPO in Q2 2026 at an estimated ~$22 billion valuation. The tradeoff: limited software compatibility (not general-purpose CUDA), but for applications where inference speed is the primary requirement, WSE-3 is in a different performance class.

AWS Trainium3 and Inferentia2

AWS's custom chips are optimized for the AWS ecosystem:

Trainium3 (EC2 Trn2): training chips; ~40% cost reduction vs. equivalent NVIDIA for supported models (Llama, BERT variants)
Inferentia2 (EC2 Inf2): inference chips; ~60% cost reduction for supported models; up to 2.3 petaflops

Neither Trainium nor Inferentia supports the full CUDA ecosystem — they use AWS's own Neuron compiler. For teams on AWS running high-volume inference on common models, the cost savings justify the adaptation. For custom architectures or research workloads, NVIDIA remains simpler.

Microsoft Maia 200

Microsoft's Maia 200 is used internally for Azure's AI inference workloads — specifically Copilot and Azure OpenAI Service responses. It reduces Microsoft's NVIDIA dependence for predictable, high-volume inference. Not a cloud product customers can purchase directly.

Memory Technology Deep Dive

The memory attached to AI chips is as important as the compute:

HBM (High Bandwidth Memory): The standard for AI chips. Multiple DRAM dies stacked vertically and bonded directly to the chip package. Current standard: HBM3e (H200, MI300X, MI350). HBM4 entered production in February 2026 (SK Hynix first, Samsung Q2 2026), enabling the Vera Rubin architecture. Only three suppliers: SK Hynix, Samsung, Micron.

GDDR7: Consumer GPU memory (gaming cards). Lower bandwidth than HBM, dramatically cheaper. Used in consumer GPUs that also run local AI models (RTX 4090: 24GB GDDR6X; RTX 5090: GDDR7).

On-chip SRAM: The fastest memory, directly on the chip die. Limited by chip area. Used for KV cache and active computation — the bottleneck for inference speed in many models.

CXL (Compute Express Link): Emerging PCIe-based standard for memory pooling and disaggregation — enables connecting additional memory to servers beyond what's physically on the GPU. Important for next-generation AI systems that need to exceed single-server memory limits.

Key Takeaways

NVIDIA dominates AI compute (~80% market share) with hardware performance and the CUDA software ecosystem — 20 years of ecosystem depth is harder to compete with than chip specs
AMD MI300X is the strongest NVIDIA alternative, particularly for memory-intensive large model inference (192GB vs. H100's 80GB); ROCm ecosystem is improving but still trails CUDA
Apple M5 (October 2025) delivers 4x AI performance over M4 with GPU Neural Accelerators; M4/M5 Max with 128GB unified memory enables running Llama 70 billion locally
Custom ASICs are maturing fast: Google TPU Ironwood v7 (42.5 Exaflops), Cerebras WSE-3 (2,100 tok/s, IPO Q2 2026), AWS Trainium3 GA (3nm) — all optimize for specific workloads at lower cost
HBM4 entered production (February 2026), enabling the next generation of GPU architectures (NVIDIA Vera Rubin, shipping H2 2026)

AI Hardware & Chips

Audio & video lessons are paid features