Cloudflare Workers AI Not Working? Fix Binding, Model & Rate Limit Errors
Troubleshoot Cloudflare Workers AI errors — AI binding not configured in wrangler.toml, model not found or unavailable, neurons/day rate limit exceeded, local dev with --remote flag, and streaming with SSE.
Common errors and fixes
AI binding not configured in wrangler.toml
The AI binding must be declared in wrangler.toml before it can be used in your Worker. Without it, env.AI will be undefined and you'll get a TypeError.
# wrangler.toml — add this block
name = "my-worker"
main = "src/index.ts"
compatibility_date = "2024-01-01"
[ai]
binding = "AI" Then in your Worker TypeScript:
export interface Env {
AI: Ai; // type from @cloudflare/workers-types
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt: 'What is the capital of France?',
});
return Response.json(response);
},
}; npx wrangler types to regenerate type definitions after adding the AI binding.
Local dev not working — must use --remote flag
Workers AI inference runs on Cloudflare's GPU network, not locally. The standard wrangler dev command doesn't support AI.
# ❌ Won't work — AI binding unavailable locally
wrangler dev
# ✅ Use --remote to run inference on Cloudflare's network
wrangler dev --remote --remote requires a Cloudflare account and active login (wrangler login). Your account is billed for usage during remote dev.
Wrong model name / model not found
Correct model ID format: @cf/[author]/[model-name]. Common mistake: using just llama-3.1-8b without the @cf/meta/ prefix.
- Text generation:
@cf/meta/llama-3.1-8b-instruct(free),@cf/meta/llama-3.3-70b-instruct-fp8-fast(paid) - Embeddings:
@cf/baai/bge-base-en-v1.5,@cf/baai/bge-large-en-v1.5 - Image generation:
@cf/black-forest-labs/flux-1-schnell - Speech to text:
@cf/openai/whisper - Translation:
@cf/meta/m2m100-1.2b
Browse the full catalog at developers.cloudflare.com/workers-ai/models/. Some models (like Llama 70B) require the Workers Paid plan.
Rate limit exceeded (neurons/day)
The free tier allows 10,000 neurons/day — roughly 1 neuron per input token for text models. Add KV caching to reduce AI calls for repeated prompts:
// Add KV caching to reduce AI calls
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { prompt } = await request.json() as { prompt: string };
const cacheKey = `ai:${crypto.subtle ? '' : ''}${btoa(prompt).slice(0, 32)}`;
// Check cache first
const cached = await env.KV.get(cacheKey);
if (cached) return Response.json(JSON.parse(cached));
// Run inference
const result = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt });
// Cache for 1 hour
await env.KV.put(cacheKey, JSON.stringify(result), { expirationTtl: 3600 });
return Response.json(result);
},
}; - Use smaller models: 8B models use fewer neurons per request than 70B models.
- Upgrade plan: Workers Paid ($5/month) unlocks much higher neuron limits.
- Monitor usage: Cloudflare Dashboard → Workers & Pages → Workers AI → Usage tab.
Streaming with SSE
Call the model with stream: true and return the response as a ReadableStream with Content-Type: text/event-stream:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt: 'Write a short story',
stream: true, // ← enable streaming
}) as ReadableStream;
return new Response(response, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
},
});
},
}; - Client-side consumption: use
EventSourceorfetchwith aReadableStreamreader to consume the SSE stream. - Model support: not all models support streaming — check the model's documentation page before enabling
stream: true. - Buffering check: if tokens aren't appearing in the browser, verify that
Content-Typeistext/event-streamand the response body is not being buffered by a middleware layer.
Know when Cloudflare Workers AI has an outage
Free email alerts. Star Cloudflare AI on Prismix — no credit card needed.
FAQ
Cloudflare Workers AI vs OpenAI API — when to use each?
Workers AI runs on Cloudflare's network co-located with your Worker, giving ultra-low latency for inference that happens alongside your edge logic. OpenAI has more capable models and more model options. Use Workers AI when you want zero additional infrastructure, free-tier inference, and data stays within your Cloudflare account. Use OpenAI for the most capable models (GPT-4o, o1).
Can I use Workers AI for embeddings + vector search?
Yes. Use @cf/baai/bge-base-en-v1.5 for embeddings and Vectorize (Cloudflare's vector DB) for storage and search. This combination stays entirely within Cloudflare's network with no external API calls.
Workers AI vs Cloudflare AI Gateway — what's the difference?
Workers AI is the inference platform (runs models). AI Gateway is a proxy/caching layer in front of external AI providers (OpenAI, Anthropic, etc.). They solve different problems: Workers AI for on-Cloudflare inference, AI Gateway for managing calls to third-party APIs.