Learning Objectives
- Understand Fly.io's edge-GPU positioning and how it differs from centralized AI clouds
- Identify the available GPU types and their workload fit
- Evaluate when edge-GPU inference makes sense vs. centralized AI clouds
What Is Fly.io GPU Machines?
Fly.io is a developer-focused cloud platform for running containerized applications close to users — its core product is Fly Machines, persistent VMs that can be deployed to any of 35 global regions in seconds. Fly GPU Machines extend this to AI workloads by adding NVIDIA GPUs to the Fly Machine fleet.
The strategic positioning: rather than competing with hyperscalers for centralized GPU training, Fly.io focuses on AI inference at the edge — running models close to users for low-latency real-time AI features. WebSocket support, persistent VMs (not serverless functions), and private networking make Fly a natural fit for AI applications that need predictable latency, persistent state, and global distribution.
💡Key Concept
Edge-GPU positioning: Centralized GPU clouds (Lambda, CoreWeave, hyperscalers) optimize for raw GPU throughput at the lowest cost per hour. Fly.io optimizes for getting AI inference close to end users globally. Trade-off: Fly's GPU pricing is typically higher than centralized clouds, but latency-sensitive workloads (chat interfaces, real-time recommendations, live transcription) benefit substantially from the edge placement. For training, Fly is the wrong tool. For real-time inference at global scale, it's a strong fit.
⚠️Warning
Public GPU pivot history: Fly.io publicly retracted its initial broad GPU hosting strategy in 2024 — its platform's strengths didn't align well with infrastructure-heavy AI workloads like big-batch training and video processing. The current GPU offering is positioned more narrowly around real-time edge inference, which is where Fly's persistent global VM model shines.
✅Tip
Visit Fly.io: fly.io — usage-based pricing; spin up GPU machines via fly machine run or the dashboard
Pricing
Fly.io uses usage-based pricing with no minimum commitments. GPU pricing is published on the live pricing page (subject to change — verify current rates before deploying).
- Smaller LLMs (Llama 3 8B at FP16)
- Stable Diffusion
- Cost-effective edge inference
- 48GB GPU RAM
- All-rounder for inference and graphics workloads
- Workhorse for production AI features
- Inference + training + scientific simulation
- Mid-tier for memory-bound workloads
- Available across 35 regions
- Largest memory in Fly's GPU lineup
- Long-context inference
- Higher-throughput training-style workloads
- Suspend/resume in seconds
- Networked storage attaches across stops
- Ideal for unpredictable inference traffic
Fly.io GPU Machines are billed per second — meaningful cost discipline for traffic that isn't constantly hammering the GPU. Suspended machines don't bill GPU hours.
Core Capabilities
35 Global Regions
Deploy the same GPU Machine config to any of 35 regions worldwide. Regions auto-distribute based on user traffic — a chat app in Tokyo gets routed to the Tokyo GPU machine, not a Virginia one. Latency advantage compounds for real-time AI features.
Four GPU Tiers
- A10 — entry-level inference; runs smaller LLMs (Llama 3 8B at FP16), Stable Diffusion, smaller diffusion models
- L40S — 48GB VRAM all-rounder for inference + graphics; workhorse for production AI features
- A100 40GB / 80GB — mid- to high-tier inference and training-style workloads; available in PCIe and SXM form factors
Persistent VMs (Not Serverless Functions)
Unlike serverless inference platforms (Cloudflare Workers AI, AWS Lambda), Fly Machines are persistent VMs — they hold state across requests, support WebSockets, run long-running connections, and can have local storage attached. Important for AI applications with conversation state, streaming responses, or persistent connections.
WebSocket Support
First-class WebSocket and HTTP/2 support — critical for streaming LLM responses, real-time voice transcription, and live AI assistants. Many serverless GPU services don't support WebSockets cleanly.
Private Networking
Fly Machines can communicate over private 6PN IPv6 mesh networks — useful for splitting AI applications into front-end VMs (CPU) and inference VMs (GPU) without exposing the GPU machines to the public internet.
Per-Second Billing
GPU usage is billed per second, with machines suspendable when not in active use. For unpredictable traffic patterns common in real-time AI features, per-second billing prevents the over-provisioning waste of hourly billing on hyperscalers.
Auto-Suspension
Idle machines can auto-suspend, then auto-resume on incoming requests. Cold starts are seconds, not minutes — much faster than spinning up fresh hyperscaler instances.
Strengths
- Global edge placement: 35 regions; latency follows the user, not a single AWS region
- Per-second billing: Cost discipline for unpredictable traffic patterns
- Persistent VMs: WebSocket, streaming, and persistent connections work cleanly — unlike serverless GPU offerings
- Auto-suspend / auto-resume: Idle machines don't bill GPU; resume in seconds on incoming traffic
- Private networking: Compose AI applications across CPU + GPU machines without public exposure
- Developer-friendly tooling:
fly deployis dramatically simpler than hyperscaler GPU instance setup
Limitations & Considerations
- Not for centralized training: Fly publicly stepped back from broad GPU hosting in 2024 — current offering is narrowly suited to edge inference, not big-batch training
- GPU pricing higher than centralized clouds: Per-GPU-hour rates typically exceed Lambda Cloud, CoreWeave, hyperscaler GPU instance rates — pay for the edge placement
- Smaller GPU lineup: A10 / L40S / A100 — no current H100, H200, or B200 access
- Capacity constraints: GPU availability varies by region; less-trafficked regions may have limited GPU capacity
- Recent strategic pivot: Fly's GPU offering scope tightened in 2024 — long-term roadmap clarity matters for production commitments
- No managed AI services: No AWS Bedrock or Azure OpenAI equivalent — bring your own model serving (vLLM, Triton, custom)
Best Use Cases
| Use Case | Why Fly.io GPU Machines Fit | Caveat |
|---|---|---|
| Real-time AI chat and assistants | Persistent VMs + WebSockets + global edge | Higher per-hour rates than centralized clouds |
| Live transcription and voice AI | WebSocket support + edge latency | Validate GPU availability in target regions |
| Per-region inference for compliance | 35 regions enable in-region data residency | Compliance audit per-region setup |
| Streaming LLM responses to global users | Auto-suspend reduces idle cost | Higher rate per active hour vs Lambda Cloud |
| AI features inside web/mobile apps | Tight Fly.io platform integration; simple deploy workflow | For training, use centralized GPU clouds |
When to choose alternatives:
- Centralized AI training → Lambda Cloud, CoreWeave, or hyperscaler GPU instances
- Largest-scale inference at lowest per-hour cost → Lambda Cloud, CoreWeave, hyperscaler bare metal
- Serverless GPU at edge with no infrastructure management → Cloudflare Workers AI for open-source models
- Frontier closed models (GPT, Claude, Gemini) → use the model providers' hosted APIs directly
Key Takeaways
- Fly.io GPU Machines are persistent global edge VMs with NVIDIA A10, L40S, and A100 (40GB/80GB) GPUs across 35 regions
- Strategic positioning: real-time AI inference at the edge with per-second billing and persistent VMs (WebSockets, streaming, persistent state)
- Public 2024 strategic retraction: Fly's GPU offering tightened from broad GPU hosting to edge-inference focus — match workload to current scope
- Best fit for real-time AI chat, live transcription, streaming LLM responses, and compliance-driven per-region inference
- For centralized GPU training or lowest per-hour cost at scale, use Lambda Cloud, CoreWeave, or hyperscaler alternatives