Name: Fly.io GPU Machines
Availability: InStock
Author: Fly.io

Learning Objectives

Understand Fly.io's edge-GPU positioning and how it differs from centralized AI clouds
Identify the available GPU types and their workload fit
Evaluate when edge-GPU inference makes sense vs. centralized AI clouds

What Is Fly.io GPU Machines?

Fly.io is a developer-focused cloud platform for running containerized applications close to users — its core product is Fly Machines, persistent VMs that can be deployed to any of 35 global regions in seconds. Fly GPU Machines extend this to AI workloads by adding NVIDIA GPUs to the Fly Machine fleet.

The strategic positioning: rather than competing with hyperscalers for centralized GPU training, Fly.io focuses on AI inference at the edge — running models close to users for low-latency real-time AI features. WebSocket support, persistent VMs (not serverless functions), and private networking make Fly a natural fit for AI applications that need predictable latency, persistent state, and global distribution.

💡Key Concept

Edge-GPU positioning: Centralized GPU clouds (Lambda, CoreWeave, hyperscalers) optimize for raw GPU throughput at the lowest cost per hour. Fly.io optimizes for getting AI inference close to end users globally. Trade-off: Fly's GPU pricing is typically higher than centralized clouds, but latency-sensitive workloads (chat interfaces, real-time recommendations, live transcription) benefit substantially from the edge placement. For training, Fly is the wrong tool. For real-time inference at global scale, it's a strong fit.

⚠️Warning

Public GPU pivot history: Fly.io publicly retracted its initial broad GPU hosting strategy in 2024 — its platform's strengths didn't align well with infrastructure-heavy AI workloads like big-batch training and video processing. The current GPU offering is positioned more narrowly around real-time edge inference, which is where Fly's persistent global VM model shines.

✅Tip

Visit Fly.io: fly.io — usage-based pricing; spin up GPU machines via fly machine run or the dashboard

Pricing

Fly.io uses usage-based pricing with no minimum commitments. GPU pricing is published on the live pricing page (subject to change — verify current rates before deploying).

Plan	Price	Features
A10 GPUs	Per-second billing	Smaller LLMs (Llama 3 8B at FP16) Stable Diffusion Cost-effective edge inference
L40S GPUs	Per-second billing	48GB GPU RAM All-rounder for inference and graphics workloads Workhorse for production AI features
A100 PCIe 40GB	Per-second billing	Inference + training + scientific simulation Mid-tier for memory-bound workloads Available across 35 regions
A100 SXM 80GB	Higher per-second rate	Largest memory in Fly's GPU lineup Long-context inference Higher-throughput training-style workloads
Persistent VM model	Pay only when machine is running	Suspend/resume in seconds Networked storage attaches across stops Ideal for unpredictable inference traffic

A10 GPUsPer-second billing

Smaller LLMs (Llama 3 8B at FP16)
Stable Diffusion
Cost-effective edge inference

L40S GPUsPer-second billing

48GB GPU RAM
All-rounder for inference and graphics workloads
Workhorse for production AI features

A100 PCIe 40GBPer-second billing

Inference + training + scientific simulation
Mid-tier for memory-bound workloads
Available across 35 regions

A100 SXM 80GBHigher per-second rate

Largest memory in Fly's GPU lineup
Long-context inference
Higher-throughput training-style workloads

Persistent VM modelPay only when machine is running

Suspend/resume in seconds
Networked storage attaches across stops
Ideal for unpredictable inference traffic

Fly.io GPU Machines are billed per second — meaningful cost discipline for traffic that isn't constantly hammering the GPU. Suspended machines don't bill GPU hours.

Core Capabilities

35 Global Regions

Deploy the same GPU Machine config to any of 35 regions worldwide. Regions auto-distribute based on user traffic — a chat app in Tokyo gets routed to the Tokyo GPU machine, not a Virginia one. Latency advantage compounds for real-time AI features.

Four GPU Tiers

A10 — entry-level inference; runs smaller LLMs (Llama 3 8B at FP16), Stable Diffusion, smaller diffusion models
L40S — 48GB VRAM all-rounder for inference + graphics; workhorse for production AI features
A100 40GB / 80GB — mid- to high-tier inference and training-style workloads; available in PCIe and SXM form factors

Persistent VMs (Not Serverless Functions)

Unlike serverless inference platforms (Cloudflare Workers AI, AWS Lambda), Fly Machines are persistent VMs — they hold state across requests, support WebSockets, run long-running connections, and can have local storage attached. Important for AI applications with conversation state, streaming responses, or persistent connections.

WebSocket Support

First-class WebSocket and HTTP/2 support — critical for streaming LLM responses, real-time voice transcription, and live AI assistants. Many serverless GPU services don't support WebSockets cleanly.

Private Networking

Fly Machines can communicate over private 6PN IPv6 mesh networks — useful for splitting AI applications into front-end VMs (CPU) and inference VMs (GPU) without exposing the GPU machines to the public internet.

Per-Second Billing

GPU usage is billed per second, with machines suspendable when not in active use. For unpredictable traffic patterns common in real-time AI features, per-second billing prevents the over-provisioning waste of hourly billing on hyperscalers.

Auto-Suspension

Idle machines can auto-suspend, then auto-resume on incoming requests. Cold starts are seconds, not minutes — much faster than spinning up fresh hyperscaler instances.

Strengths

Global edge placement: 35 regions; latency follows the user, not a single AWS region
Per-second billing: Cost discipline for unpredictable traffic patterns
Persistent VMs: WebSocket, streaming, and persistent connections work cleanly — unlike serverless GPU offerings
Auto-suspend / auto-resume: Idle machines don't bill GPU; resume in seconds on incoming traffic
Private networking: Compose AI applications across CPU + GPU machines without public exposure
Developer-friendly tooling: fly deploy is dramatically simpler than hyperscaler GPU instance setup

Limitations & Considerations

Not for centralized training: Fly publicly stepped back from broad GPU hosting in 2024 — current offering is narrowly suited to edge inference, not big-batch training
GPU pricing higher than centralized clouds: Per-GPU-hour rates typically exceed Lambda Cloud, CoreWeave, hyperscaler GPU instance rates — pay for the edge placement
Smaller GPU lineup: A10 / L40S / A100 — no current H100, H200, or B200 access
Capacity constraints: GPU availability varies by region; less-trafficked regions may have limited GPU capacity
Recent strategic pivot: Fly's GPU offering scope tightened in 2024 — long-term roadmap clarity matters for production commitments
No managed AI services: No AWS Bedrock or Azure OpenAI equivalent — bring your own model serving (vLLM, Triton, custom)

Best Use Cases

Use Case	Why Fly.io GPU Machines Fit	Caveat
Real-time AI chat and assistants	Persistent VMs + WebSockets + global edge	Higher per-hour rates than centralized clouds
Live transcription and voice AI	WebSocket support + edge latency	Validate GPU availability in target regions
Per-region inference for compliance	35 regions enable in-region data residency	Compliance audit per-region setup
Streaming LLM responses to global users	Auto-suspend reduces idle cost	Higher rate per active hour vs Lambda Cloud
AI features inside web/mobile apps	Tight Fly.io platform integration; simple deploy workflow	For training, use centralized GPU clouds

When to choose alternatives:

Centralized AI training → Lambda Cloud, CoreWeave, or hyperscaler GPU instances
Largest-scale inference at lowest per-hour cost → Lambda Cloud, CoreWeave, hyperscaler bare metal
Serverless GPU at edge with no infrastructure management → Cloudflare Workers AI for open-source models
Frontier closed models (GPT, Claude, Gemini) → use the model providers' hosted APIs directly

Key Takeaways

Fly.io GPU Machines are persistent global edge VMs with NVIDIA A10, L40S, and A100 (40GB/80GB) GPUs across 35 regions
Strategic positioning: real-time AI inference at the edge with per-second billing and persistent VMs (WebSockets, streaming, persistent state)
Public 2024 strategic retraction: Fly's GPU offering tightened from broad GPU hosting to edge-inference focus — match workload to current scope
Best fit for real-time AI chat, live transcription, streaming LLM responses, and compliance-driven per-region inference
For centralized GPU training or lowest per-hour cost at scale, use Lambda Cloud, CoreWeave, or hyperscaler alternatives

Fly.io GPU Machines

Audio & video lessons are paid features