9.11 — Edge AI | AI Pro Playbook

Learning Objectives

Explain the five primary motivations for edge AI deployment and when each is the deciding factor
Identify the leading models and frameworks for local inference
Evaluate which hardware options support local model inference at the consumer and professional levels

The Cloud vs. Edge Tradeoff

Every AI application faces a fundamental architectural choice: where does inference happen?

Cloud inference: Send a request to a remote API, receive a response. Simpler to build, access to frontier models, scales automatically, no local hardware required. Costs per query, requires internet connectivity, data leaves the device.

Edge inference: The model runs locally — on a laptop, a server in your building, an on-device chip in a phone. More complex to set up, limited to models that fit available hardware, but: private by design, no latency beyond local compute, zero marginal cost after hardware purchase, works offline.

The right architecture depends on your requirements. Cloud inference is the right default. Edge becomes the right choice when specific constraints make cloud inference unsuitable.

Five Reasons to Go to the Edge

1. Privacy

When data never leaves the device, there is no privacy concern. The model processes your input locally and produces output locally — nothing is transmitted to a third party.

This matters for:

Healthcare applications processing patient information
Legal work with privileged documents
Financial analysis with confidential data
Personal assistants that should never send your private conversations to external servers
Enterprise deployments in industries with strict data handling regulations

2. Latency

Cloud round-trips add latency: 50-500ms per request depending on geographic distance and server load. For real-time applications, this is prohibitive:

Voice assistants where conversational delay is jarring
Autonomous vehicles making split-second decisions
AR/VR applications where AI response must be imperceptible
Industrial control systems with hard real-time requirements

Edge inference can process in under 10ms on capable hardware — 10-50x lower latency than cloud.

3. Cost at Scale

Cloud inference costs accumulate: at $0.003 per thousand tokens (Haiku 4.5 pricing), a million daily queries cost thousands per month. For applications with sufficient query volume, local hardware can pay for itself in months.

Consumer hardware example: an RTX 4090 (~$2,000) running Llama 4 Scout locally eliminates ongoing API costs for that workload. Break-even calculation: if you'd otherwise spend $500/month on inference, the hardware pays back in 4 months.

4. Reliability

Edge AI works offline. Cloud AI requires internet connectivity. For:

Aircraft and ships in transit
Remote industrial equipment
Medical devices in facilities with limited connectivity
Disaster response scenarios
Any application where connectivity is intermittent

Edge inference provides the reliability guarantee that cloud cannot.

5. Data Sovereignty

Some countries require data to remain within their geographic boundaries. Some industries have regulations specifying where data can be processed. On-premise deployment satisfies requirements that cloud APIs — which may route through data centers in other jurisdictions — cannot guarantee.

Edge LLMs (Local Inference Models)

Not every model is designed for edge deployment. The best edge models are small, efficient, and specifically optimized for local inference:

Phi-4 Family (Microsoft, MIT License)

Microsoft's Phi-4 series has expanded significantly — now a full family of edge-optimized models:

Phi-4 Mini (3.8 billion): Optimized for mobile devices; fits in 8GB of GPU memory
Phi-4 Multimodal (5.6 billion): Handles text, images, and audio on-device
Phi-4 Reasoning and Phi-4 Reasoning Plus: Specialized for chain-of-thought reasoning at small scale
Phi-4 Reasoning Vision 15 billion (March 2026): Vision + reasoning capabilities in a single model
All MIT license: free for commercial deployment
Works with ONNX Runtime and DirectML for Windows NPU acceleration

Apple MLX

MLX (v0.31.1) is Apple's machine learning framework for M-series chips. It's not a model — it's a framework that enables efficient inference of popular models on Apple Silicon. The latest version adds M5 Neural Accelerator support, leveraging the 4x AI performance improvement in the M5 generation. Models converted to MLX format often run faster on M4/M5 than in llama.cpp or other frameworks on the same hardware.

Llama, Gemma, Mistral, and Phi models are all available in MLX-converted form on Hugging Face.

Llama 4 Scout (Meta, Open License)

Llama 4 Scout (17 billion active parameters / 16 experts MoE, 10 million token context) is Meta's edge-friendly model — designed to be deployable on a single high-end GPU. The Scout variant (vs. Maverick with 128 experts) prioritizes deployability over maximum capability. Both are natively multimodal (text + images).

Strong performance on most knowledge and reasoning tasks; the active open-source community has produced optimized inference configurations for multiple hardware targets.

Gemma 3 Family (Google, Apache 2.0)

Google's small model family has expanded significantly (1 billion-27 billion parameters) with new edge-specific variants:

Gemma 3n: Optimized specifically for mobile devices — smallest form factor
Gemma 3 270 million: Ultra-compact variant for extremely constrained environments
FunctionGemma: Specialized for agent tool-calling — enables on-device AI agents
128K context window — unusually large for a small model
Strong multilingual support
Available in quantized versions for memory-constrained deployments

Gemma 3's combination of multilingual depth, 128K context, and purpose-built mobile variants (3n, 270 million) makes it the most versatile edge model family from any provider.

Ollama — The Easiest Path to Local Models

Ollama is not a model — it's a tool that makes running local models as simple as a single terminal command:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Download and run Llama 4 Scout
ollama run llama4:scout

# Run DeepSeek R1
ollama run deepseek-r1:7b

# Run Gemma 3
ollama run gemma3:27b

Ollama (v0.18.2) handles model downloading, quantization selection, CPU/GPU configuration, and presents a simple REST API. Recent additions include web search (agents can search the web), cloud model support (connect to remote APIs), and Windows ARM64 native builds:

import requests
response = requests.post("http://localhost:11434/api/generate",
    json={"model": "llama4:scout", "prompt": "Explain gradient descent"})

For any developer experimenting with local models, Ollama is the right starting point. It works on Mac (M-series), Linux, and Windows.

✅Tip

Start with Ollama. Install it in 2 minutes, run ollama run llama4:scout, and have a local LLM running immediately. No API key, no cloud account, no cost. It's the fastest way to understand what local inference feels like and whether it meets your latency requirements.

Edge AI Hardware

NVIDIA DGX Spark GB10

NVIDIA's DGX Spark is a desktop-form-factor AI workstation, now shipping at $4,699 (raised from the $3,999 announcement price):

128GB of unified memory
1 PFLOPS of AI performance
Runs frontier-class models (Llama 4 Maverick, DeepSeek V3) locally

Targeted at AI researchers and professionals who need powerful local inference for privacy-sensitive or cost-sensitive workloads without building a full server infrastructure.

Apple MacBook Pro M4 Max

The M4 Max with 128GB unified memory is currently the best consumer hardware for local AI:

Llama 4 Scout: 40+ tokens/second
DeepSeek R1 14 billion: fast inference
70 billion parameter models: feasible, slower
MLX framework optimization for Apple Silicon

For professionals who need a portable workstation that handles both local AI inference and regular development work, M4 Max MacBook Pro is the practical choice.

Windows AI PCs (Copilot+ PCs)

Intel and AMD have introduced Copilot+ PCs with dedicated Neural Processing Units (NPUs). Microsoft is now shifting priority from NPU to GPU/CPU for AI workloads, as discrete GPUs deliver more capable inference:

NPUs still handle small model inference (Phi-4 Mini, Whisper) efficiently
Recall (controversial screen recording feature) is now fully opt-in
Intel Panther Lake processors target 50 TOPS NPU performance
More limited than discrete GPU setups for large models — the NPU is supplemental, not primary

The NPU category continues to mature — practical for always-on features (transcription, local assistant) but not for frontier model inference.

Consumer NVIDIA GPUs

The RTX 4090 (24GB GDDR6X) and RTX 5090 (32GB GDDR7) are the best consumer NVIDIA options for local inference:

More VRAM than most workstation GPUs at the price
Fast enough for real-time inference on most 7 billion-13 billion models
Can run quantized versions of larger models (30 billion-70 billion with 4-bit quantization)

For developers who already have a gaming-capable GPU, local model inference is often possible without additional hardware investment.

Key Takeaways

Edge AI is the right choice when privacy, offline reliability, latency, cost at scale, or data sovereignty requirements make cloud-only inference unsuitable — not as a default
The leading edge LLMs have expanded: Phi-4 family (6 variants from 3.8 billion mobile to 15 billion reasoning-vision); Llama 4 Scout (17 billion MoE, 10 million context); Gemma 3 family (including 3n mobile, 270 million compact, FunctionGemma for tool-calling); DeepSeek R1 for open-source reasoning
Ollama is the fastest path to experimenting with local models: one installation command, then ollama run [model-name] to download and run any supported model
Hardware options range from M4 Max MacBook Pro (128GB unified, excellent consumer option) to NVIDIA DGX Spark (purpose-built desktop AI workstation) to consumer RTX cards for development use

Edge AI

Audio & video lessons are paid features