Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
6 min read·Updated April 29, 2026

Fly.io GPU Machines

Fly.io logoBy Fly.io

Fly.io GPU Machines are persistent global edge VMs with NVIDIA A10, L40S, and A100 GPUs deployable across 35 regions — designed for low-latency AI inference at the edge rather than centralized training.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand Fly.io's edge-GPU positioning and how it differs from centralized AI clouds
  • Identify the available GPU types and their workload fit
  • Evaluate when edge-GPU inference makes sense vs. centralized AI clouds

What Is Fly.io GPU Machines?

Fly.io is a developer-focused cloud platform for running containerized applications close to users — its core product is Fly Machines, persistent VMs that can be deployed to any of 35 global regions in seconds. Fly GPU Machines extend this to AI workloads by adding NVIDIA GPUs to the Fly Machine fleet.

The strategic positioning: rather than competing with hyperscalers for centralized GPU training, Fly.io focuses on AI inference at the edge — running models close to users for low-latency real-time AI features. WebSocket support, persistent VMs (not serverless functions), and private networking make Fly a natural fit for AI applications that need predictable latency, persistent state, and global distribution.

💡Key Concept

Edge-GPU positioning: Centralized GPU clouds (Lambda, CoreWeave, hyperscalers) optimize for raw GPU throughput at the lowest cost per hour. Fly.io optimizes for getting AI inference close to end users globally. Trade-off: Fly's GPU pricing is typically higher than centralized clouds, but latency-sensitive workloads (chat interfaces, real-time recommendations, live transcription) benefit substantially from the edge placement. For training, Fly is the wrong tool. For real-time inference at global scale, it's a strong fit.

⚠️Warning

Public GPU pivot history: Fly.io publicly retracted its initial broad GPU hosting strategy in 2024 — its platform's strengths didn't align well with infrastructure-heavy AI workloads like big-batch training and video processing. The current GPU offering is positioned more narrowly around real-time edge inference, which is where Fly's persistent global VM model shines.

Tip

Visit Fly.io: fly.io — usage-based pricing; spin up GPU machines via fly machine run or the dashboard

Pricing

Fly.io uses usage-based pricing with no minimum commitments. GPU pricing is published on the live pricing page (subject to change — verify current rates before deploying).

A10 GPUsPer-second billing
  • Smaller LLMs (Llama 3 8B at FP16)
  • Stable Diffusion
  • Cost-effective edge inference
L40S GPUsPer-second billing
  • 48GB GPU RAM
  • All-rounder for inference and graphics workloads
  • Workhorse for production AI features
A100 PCIe 40GBPer-second billing
  • Inference + training + scientific simulation
  • Mid-tier for memory-bound workloads
  • Available across 35 regions
A100 SXM 80GBHigher per-second rate
  • Largest memory in Fly's GPU lineup
  • Long-context inference
  • Higher-throughput training-style workloads
Persistent VM modelPay only when machine is running
  • Suspend/resume in seconds
  • Networked storage attaches across stops
  • Ideal for unpredictable inference traffic

Fly.io GPU Machines are billed per second — meaningful cost discipline for traffic that isn't constantly hammering the GPU. Suspended machines don't bill GPU hours.

Core Capabilities

35 Global Regions

Deploy the same GPU Machine config to any of 35 regions worldwide. Regions auto-distribute based on user traffic — a chat app in Tokyo gets routed to the Tokyo GPU machine, not a Virginia one. Latency advantage compounds for real-time AI features.

Four GPU Tiers

  • A10 — entry-level inference; runs smaller LLMs (Llama 3 8B at FP16), Stable Diffusion, smaller diffusion models
  • L40S — 48GB VRAM all-rounder for inference + graphics; workhorse for production AI features
  • A100 40GB / 80GB — mid- to high-tier inference and training-style workloads; available in PCIe and SXM form factors

Persistent VMs (Not Serverless Functions)

Unlike serverless inference platforms (Cloudflare Workers AI, AWS Lambda), Fly Machines are persistent VMs — they hold state across requests, support WebSockets, run long-running connections, and can have local storage attached. Important for AI applications with conversation state, streaming responses, or persistent connections.

WebSocket Support

First-class WebSocket and HTTP/2 support — critical for streaming LLM responses, real-time voice transcription, and live AI assistants. Many serverless GPU services don't support WebSockets cleanly.

Private Networking

Fly Machines can communicate over private 6PN IPv6 mesh networks — useful for splitting AI applications into front-end VMs (CPU) and inference VMs (GPU) without exposing the GPU machines to the public internet.

Per-Second Billing

GPU usage is billed per second, with machines suspendable when not in active use. For unpredictable traffic patterns common in real-time AI features, per-second billing prevents the over-provisioning waste of hourly billing on hyperscalers.

Auto-Suspension

Idle machines can auto-suspend, then auto-resume on incoming requests. Cold starts are seconds, not minutes — much faster than spinning up fresh hyperscaler instances.

Strengths

  • Global edge placement: 35 regions; latency follows the user, not a single AWS region
  • Per-second billing: Cost discipline for unpredictable traffic patterns
  • Persistent VMs: WebSocket, streaming, and persistent connections work cleanly — unlike serverless GPU offerings
  • Auto-suspend / auto-resume: Idle machines don't bill GPU; resume in seconds on incoming traffic
  • Private networking: Compose AI applications across CPU + GPU machines without public exposure
  • Developer-friendly tooling: fly deploy is dramatically simpler than hyperscaler GPU instance setup

Limitations & Considerations

  • Not for centralized training: Fly publicly stepped back from broad GPU hosting in 2024 — current offering is narrowly suited to edge inference, not big-batch training
  • GPU pricing higher than centralized clouds: Per-GPU-hour rates typically exceed Lambda Cloud, CoreWeave, hyperscaler GPU instance rates — pay for the edge placement
  • Smaller GPU lineup: A10 / L40S / A100 — no current H100, H200, or B200 access
  • Capacity constraints: GPU availability varies by region; less-trafficked regions may have limited GPU capacity
  • Recent strategic pivot: Fly's GPU offering scope tightened in 2024 — long-term roadmap clarity matters for production commitments
  • No managed AI services: No AWS Bedrock or Azure OpenAI equivalent — bring your own model serving (vLLM, Triton, custom)

Best Use Cases

Use CaseWhy Fly.io GPU Machines FitCaveat
Real-time AI chat and assistantsPersistent VMs + WebSockets + global edgeHigher per-hour rates than centralized clouds
Live transcription and voice AIWebSocket support + edge latencyValidate GPU availability in target regions
Per-region inference for compliance35 regions enable in-region data residencyCompliance audit per-region setup
Streaming LLM responses to global usersAuto-suspend reduces idle costHigher rate per active hour vs Lambda Cloud
AI features inside web/mobile appsTight Fly.io platform integration; simple deploy workflowFor training, use centralized GPU clouds

When to choose alternatives:

  • Centralized AI training → Lambda Cloud, CoreWeave, or hyperscaler GPU instances
  • Largest-scale inference at lowest per-hour cost → Lambda Cloud, CoreWeave, hyperscaler bare metal
  • Serverless GPU at edge with no infrastructure management → Cloudflare Workers AI for open-source models
  • Frontier closed models (GPT, Claude, Gemini) → use the model providers' hosted APIs directly

Key Takeaways

  • Fly.io GPU Machines are persistent global edge VMs with NVIDIA A10, L40S, and A100 (40GB/80GB) GPUs across 35 regions
  • Strategic positioning: real-time AI inference at the edge with per-second billing and persistent VMs (WebSockets, streaming, persistent state)
  • Public 2024 strategic retraction: Fly's GPU offering tightened from broad GPU hosting to edge-inference focus — match workload to current scope
  • Best fit for real-time AI chat, live transcription, streaming LLM responses, and compliance-driven per-region inference
  • For centralized GPU training or lowest per-hour cost at scale, use Lambda Cloud, CoreWeave, or hyperscaler alternatives

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you