Name: Cloudflare Workers AI
Availability: InStock
Author: Cloudflare

Learning Objectives

Understand serverless edge AI inference and where Workers AI fits the AI compute stack
Identify the Neuron-based pricing model and how it scales with usage
Evaluate when Workers AI beats AWS Bedrock, OpenAI API, or self-hosted GPU inference

What Is Cloudflare Workers AI?

Workers AI is Cloudflare's serverless GPU inference platform — running open-source AI models at hundreds of global edge locations with no GPU rental, no cold-start delays, and per-request pricing. It is the AI inference layer of the Cloudflare developer platform, alongside Workers (compute), R2 (storage), D1 (SQL), KV (key-value), and Durable Objects.

Where AWS Bedrock and Azure OpenAI sell access to closed flagship models from a few regions, Workers AI runs open-source models (Llama, Mistral, embeddings, image classification, speech-to-text) on Cloudflare-managed GPUs at 300+ global edge points. The result: latency that depends on how close the user is to a Cloudflare data center (typically under 100ms anywhere in the world) rather than how close they are to a single AWS region.

💡Key Concept

Why edge AI inference: Most AI workloads don't need a frontier flagship model — they need fast, cheap, predictable inference on smaller models (sentiment, classification, embeddings, simple chat, transcription). Edge inference moves the compute close to the user, dropping latency from hundreds of milliseconds (centralized region) to tens of milliseconds (nearby edge). For real-time AI features inside a web or mobile app, this is the difference between snappy and sluggish.

✅Tip

Visit Workers AI: developers.cloudflare.com/workers-ai — free 10,000 Neurons daily; Workers Paid plan unlocks higher limits

Pricing

Workers AI pricing is Neuron-based — Neurons measure GPU compute consumed per request, with each model type having its own Neuron rate. Per-model prices are also published in tokens or images for easier estimation.

Plan	Price	Features
Free	10,000 Neurons per day	All available models Rate-limited by daily Neuron cap No credit card required
Workers Paid	$5/month base + $0.011 per 1,000 Neurons over free allocation	Same model catalog as Free Higher concurrency limits Single-billing for all Cloudflare developer products
Per-Model Token Pricing	Examples: $0.003 per million input tokens (small models)	Up to $2.51 per million images for vision models Granular per-model rates published in docs Predictable cost scaling
Workers Enterprise	Custom contract	Higher rate limits SLA guarantees Dedicated support

Free10,000 Neurons per day

All available models
Rate-limited by daily Neuron cap
No credit card required

Workers Paid$5/month base + $0.011 per 1,000 Neurons over free allocation

Same model catalog as Free
Higher concurrency limits
Single-billing for all Cloudflare developer products

Per-Model Token PricingExamples: $0.003 per million input tokens (small models)

Up to $2.51 per million images for vision models
Granular per-model rates published in docs
Predictable cost scaling

Workers EnterpriseCustom contract

Higher rate limits
SLA guarantees
Dedicated support

The free tier is genuinely usable for prototyping and small projects — 10,000 Neurons per day covers thousands of small-model inferences. Cost discipline matters at scale: for high-volume production traffic, compare Neuron pricing against self-hosted GPU economics.

Core Features

Serverless GPU Execution

Pay only for the inference you run — no idle GPU rental, no instance management, no cold-start penalty. Cloudflare manages the GPU pool and routes requests to the nearest available compute. Models warm in milliseconds rather than minutes.

Global Edge Network (300+ locations)

Models run in Cloudflare's global edge network — same infrastructure that serves CDN traffic. End-user latency depends on geographic distance to the nearest Cloudflare data center, typically under 100ms anywhere in the world. Compare to centralized AWS/Azure regions where users in distant regions add 100-300ms of round-trip latency.

Open-Source Model Catalog

Hosted models include Llama variants (Meta), Mistral, embedding models (BGE, MiniLM), Whisper (speech-to-text), Stable Diffusion XL, image classification (ResNet, EfficientNet), and dozens more. Cloudflare keeps the catalog updated with new open-source releases.

Tight Workers Integration

Call Workers AI from a Workers script with env.AI.run('model-name', { prompt: '...' }) — no API key juggling, no auth headers, no separate SDK. The integration is designed for developers building AI features inside web apps using the Cloudflare developer platform.

AI Gateway

Sits in front of any model endpoint (Workers AI, OpenAI, Anthropic) and adds caching, rate limiting, retries, analytics, and a unified observability surface. Useful for production AI applications mixing multiple model providers.

Vectorize (Vector Database)

Cloudflare's vector database for embeddings — pairs naturally with Workers AI embedding models for RAG (Retrieval-Augmented Generation) workloads. Vectors and inference run in the same network, minimizing round-trips.

Stripe Projects — Agentic Onboarding (May 2026 Open Beta)

In May 2026, Cloudflare and Stripe rolled out an open beta of a Stripe Projects integration that gives AI agents end-to-end self-service onboarding to the Cloudflare developer platform: an agent can create a Cloudflare account, buy a domain, and deploy a Worker without a human clicking through forms. Stripe acts as the orchestrator — it handles KYC, issues a scoped payment token to the agent rather than a real card number, and enforces a default 100 dollars per month spending cap per provider. A human still grants the initial permission and accepts terms of service, but for short-running agentic deployments this is the cleanest "agent buys its own stack" pattern any major cloud has shipped to date. The open beta is gated behind Stripe Projects and standard Workers Paid pricing applies once the agent's traffic hits paid limits.

Strengths

Generous free tier: 10,000 Neurons per day with no credit card — meaningful for prototyping
Global edge inference: Sub-100ms latency anywhere in the world for most models
Predictable pricing: Neuron-based metering converts cleanly to per-token or per-image rates
Open-source model focus: Llama, Mistral, and other open models — no licensing surprises
Cloudflare ecosystem fit: Tight integration with Workers, R2, D1, Vectorize, AI Gateway — single platform for full-stack AI apps
No GPU operations: Cloudflare manages the GPU pool; you write code

Limitations & Considerations

No frontier closed models: GPT-5, Claude Opus, Gemini Ultra are not on Workers AI — for those, use AI Gateway in front of OpenAI/Anthropic/Google APIs
Smaller open-source models: Catalog focuses on production-ready open models, not the absolute largest variants — fine for most workloads, limiting for some
Cost at extreme scale: For very high-volume inference (millions of requests per minute), self-hosted GPU economics may beat Neuron pricing — model your specific workload
Lock-in risk: AI Gateway, Workers AI, Vectorize are tightly coupled — moving off Cloudflare means rewriting the AI stack
Cloudflare dependency: Outages on Cloudflare's edge network affect Workers AI alongside the rest of the developer platform — use AI Gateway redundancy for production reliability

Best Use Cases

Use Case	Why Workers AI Fits	Caveat
Real-time AI features in web apps	Sub-100ms global latency + Workers integration	Smaller open-source models, not flagship closed models
Embedding generation + RAG	Workers AI + Vectorize + AI Gateway in one platform	For frontier-quality embeddings, OpenAI/Voyage may rank higher
Speech-to-text at scale	Whisper hosted on edge, low per-token pricing	Compare against Deepgram or Assembly for production accuracy
Image classification / moderation	EfficientNet, ResNet, vision models on edge GPUs	For custom-trained models, host on dedicated GPU clouds
Prototyping AI features	Free tier with no credit card lowers experimentation cost	Move to Paid plan for production traffic

When to choose alternatives:

Frontier model quality needed → OpenAI API, Anthropic API, or Google Gemini API (use AI Gateway in front of them)
Massive training workloads → dedicated GPU cloud (Lambda Cloud, CoreWeave, hyperscaler AI services)
Custom-trained model hosting → Modal, Replicate, or AWS SageMaker for arbitrary container deployment
Largest-scale production inference → self-hosted GPU on Lambda, CoreWeave, or hyperscaler bare metal

Key Takeaways

Cloudflare Workers AI is serverless GPU inference for open-source AI models, running at 300+ global edge locations with sub-100ms typical latency
Pricing is Neuron-based: 10,000 Neurons per day free, $0.011 per 1,000 Neurons after, with per-model rates also published in tokens and images
Best fit for production AI features inside web/mobile apps where latency, cost predictability, and tight Workers integration matter more than absolute model frontier quality
Pair with AI Gateway to mix Workers AI with OpenAI / Anthropic / Google for hybrid open + closed model architectures
For frontier-quality flagship models, training workloads, or extreme-scale production inference, alternatives (OpenAI API, Lambda Cloud, dedicated GPU hosting) often serve better
The May 2026 Cloudflare + Stripe Projects beta lets AI agents self-serve account creation, domain purchase, and Worker deployment with Stripe orchestrating KYC + scoped payment tokens + a default 100 dollars per month spending cap per provider

Cloudflare Workers AI

Audio & video lessons are paid features