Learning Objectives
- Understand serverless edge AI inference and where Workers AI fits the AI compute stack
- Identify the Neuron-based pricing model and how it scales with usage
- Evaluate when Workers AI beats AWS Bedrock, OpenAI API, or self-hosted GPU inference
What Is Cloudflare Workers AI?
Workers AI is Cloudflare's serverless GPU inference platform — running open-source AI models at hundreds of global edge locations with no GPU rental, no cold-start delays, and per-request pricing. It is the AI inference layer of the Cloudflare developer platform, alongside Workers (compute), R2 (storage), D1 (SQL), KV (key-value), and Durable Objects.
Where AWS Bedrock and Azure OpenAI sell access to closed flagship models from a few regions, Workers AI runs open-source models (Llama, Mistral, embeddings, image classification, speech-to-text) on Cloudflare-managed GPUs at 300+ global edge points. The result: latency that depends on how close the user is to a Cloudflare data center (typically under 100ms anywhere in the world) rather than how close they are to a single AWS region.
💡Key Concept
Why edge AI inference: Most AI workloads don't need a frontier flagship model — they need fast, cheap, predictable inference on smaller models (sentiment, classification, embeddings, simple chat, transcription). Edge inference moves the compute close to the user, dropping latency from hundreds of milliseconds (centralized region) to tens of milliseconds (nearby edge). For real-time AI features inside a web or mobile app, this is the difference between snappy and sluggish.
✅Tip
Visit Workers AI: developers.cloudflare.com/workers-ai — free 10,000 Neurons daily; Workers Paid plan unlocks higher limits
Pricing
Workers AI pricing is Neuron-based — Neurons measure GPU compute consumed per request, with each model type having its own Neuron rate. Per-model prices are also published in tokens or images for easier estimation.
- All available models
- Rate-limited by daily Neuron cap
- No credit card required
- Same model catalog as Free
- Higher concurrency limits
- Single-billing for all Cloudflare developer products
- Up to $2.51 per million images for vision models
- Granular per-model rates published in docs
- Predictable cost scaling
- Higher rate limits
- SLA guarantees
- Dedicated support
The free tier is genuinely usable for prototyping and small projects — 10,000 Neurons per day covers thousands of small-model inferences. Cost discipline matters at scale: for high-volume production traffic, compare Neuron pricing against self-hosted GPU economics.
Core Features
Serverless GPU Execution
Pay only for the inference you run — no idle GPU rental, no instance management, no cold-start penalty. Cloudflare manages the GPU pool and routes requests to the nearest available compute. Models warm in milliseconds rather than minutes.
Global Edge Network (300+ locations)
Models run in Cloudflare's global edge network — same infrastructure that serves CDN traffic. End-user latency depends on geographic distance to the nearest Cloudflare data center, typically under 100ms anywhere in the world. Compare to centralized AWS/Azure regions where users in distant regions add 100-300ms of round-trip latency.
Open-Source Model Catalog
Hosted models include Llama variants (Meta), Mistral, embedding models (BGE, MiniLM), Whisper (speech-to-text), Stable Diffusion XL, image classification (ResNet, EfficientNet), and dozens more. Cloudflare keeps the catalog updated with new open-source releases.
Tight Workers Integration
Call Workers AI from a Workers script with env.AI.run('model-name', { prompt: '...' }) — no API key juggling, no auth headers, no separate SDK. The integration is designed for developers building AI features inside web apps using the Cloudflare developer platform.
AI Gateway
Sits in front of any model endpoint (Workers AI, OpenAI, Anthropic) and adds caching, rate limiting, retries, analytics, and a unified observability surface. Useful for production AI applications mixing multiple model providers.
Vectorize (Vector Database)
Cloudflare's vector database for embeddings — pairs naturally with Workers AI embedding models for RAG (Retrieval-Augmented Generation) workloads. Vectors and inference run in the same network, minimizing round-trips.
Stripe Projects — Agentic Onboarding (May 2026 Open Beta)
In May 2026, Cloudflare and Stripe rolled out an open beta of a Stripe Projects integration that gives AI agents end-to-end self-service onboarding to the Cloudflare developer platform: an agent can create a Cloudflare account, buy a domain, and deploy a Worker without a human clicking through forms. Stripe acts as the orchestrator — it handles KYC, issues a scoped payment token to the agent rather than a real card number, and enforces a default 100 dollars per month spending cap per provider. A human still grants the initial permission and accepts terms of service, but for short-running agentic deployments this is the cleanest "agent buys its own stack" pattern any major cloud has shipped to date. The open beta is gated behind Stripe Projects and standard Workers Paid pricing applies once the agent's traffic hits paid limits.
Strengths
- Generous free tier: 10,000 Neurons per day with no credit card — meaningful for prototyping
- Global edge inference: Sub-100ms latency anywhere in the world for most models
- Predictable pricing: Neuron-based metering converts cleanly to per-token or per-image rates
- Open-source model focus: Llama, Mistral, and other open models — no licensing surprises
- Cloudflare ecosystem fit: Tight integration with Workers, R2, D1, Vectorize, AI Gateway — single platform for full-stack AI apps
- No GPU operations: Cloudflare manages the GPU pool; you write code
Limitations & Considerations
- No frontier closed models: GPT-5, Claude Opus, Gemini Ultra are not on Workers AI — for those, use AI Gateway in front of OpenAI/Anthropic/Google APIs
- Smaller open-source models: Catalog focuses on production-ready open models, not the absolute largest variants — fine for most workloads, limiting for some
- Cost at extreme scale: For very high-volume inference (millions of requests per minute), self-hosted GPU economics may beat Neuron pricing — model your specific workload
- Lock-in risk: AI Gateway, Workers AI, Vectorize are tightly coupled — moving off Cloudflare means rewriting the AI stack
- Cloudflare dependency: Outages on Cloudflare's edge network affect Workers AI alongside the rest of the developer platform — use AI Gateway redundancy for production reliability
Best Use Cases
| Use Case | Why Workers AI Fits | Caveat |
|---|---|---|
| Real-time AI features in web apps | Sub-100ms global latency + Workers integration | Smaller open-source models, not flagship closed models |
| Embedding generation + RAG | Workers AI + Vectorize + AI Gateway in one platform | For frontier-quality embeddings, OpenAI/Voyage may rank higher |
| Speech-to-text at scale | Whisper hosted on edge, low per-token pricing | Compare against Deepgram or Assembly for production accuracy |
| Image classification / moderation | EfficientNet, ResNet, vision models on edge GPUs | For custom-trained models, host on dedicated GPU clouds |
| Prototyping AI features | Free tier with no credit card lowers experimentation cost | Move to Paid plan for production traffic |
When to choose alternatives:
- Frontier model quality needed → OpenAI API, Anthropic API, or Google Gemini API (use AI Gateway in front of them)
- Massive training workloads → dedicated GPU cloud (Lambda Cloud, CoreWeave, hyperscaler AI services)
- Custom-trained model hosting → Modal, Replicate, or AWS SageMaker for arbitrary container deployment
- Largest-scale production inference → self-hosted GPU on Lambda, CoreWeave, or hyperscaler bare metal
Key Takeaways
- Cloudflare Workers AI is serverless GPU inference for open-source AI models, running at 300+ global edge locations with sub-100ms typical latency
- Pricing is Neuron-based: 10,000 Neurons per day free, $0.011 per 1,000 Neurons after, with per-model rates also published in tokens and images
- Best fit for production AI features inside web/mobile apps where latency, cost predictability, and tight Workers integration matter more than absolute model frontier quality
- Pair with AI Gateway to mix Workers AI with OpenAI / Anthropic / Google for hybrid open + closed model architectures
- For frontier-quality flagship models, training workloads, or extreme-scale production inference, alternatives (OpenAI API, Lambda Cloud, dedicated GPU hosting) often serve better
- The May 2026 Cloudflare + Stripe Projects beta lets AI agents self-serve account creation, domain purchase, and Worker deployment with Stripe orchestrating KYC + scoped payment tokens + a default 100 dollars per month spending cap per provider