Why is Cloudflare Workers AI not working?

Common causes: (1) AI binding not added to wrangler.toml — you must add [ai] binding block; (2) model name wrong or model not available in your plan; (3) free tier rate limit — 10,000 neurons/day (changed from 3,000 in 2024); (4) local dev wrangler issue — `wrangler dev --remote` required for AI inference; (5) response not awaited — AI calls are async and must be awaited.

Cloudflare Workers AI model not found or unavailable?

Model names must match exactly, including the @cf/ prefix and version suffix. Browse available models at developers.cloudflare.com/workers-ai/models/. Some models (like Llama 70B) are only on paid plans. Common correct model IDs: @cf/meta/llama-3.1-8b-instruct, @cf/mistral/mistral-7b-instruct-v0.1, @cf/google/gemma-7b-it, @cf/baai/bge-base-en-v1.5 (for embeddings).

Cloudflare Workers AI rate limit exceeded?

The free tier allows 10,000 neurons/day. Neurons are the billing unit — roughly 1 neuron per input token for text models. Solutions: (1) add caching for repeated prompts using KV store; (2) use smaller/faster models (8B instead of 70B uses fewer neurons); (3) upgrade to Workers Paid plan ($5/month) for much higher limits; (4) monitor usage in Cloudflare dashboard → Workers AI → Usage.

Cloudflare Workers AI streaming (SSE) not working?

Workers AI supports streaming via EventSourceParserStream. The model must be called with stream: true and the response must be returned as a ReadableStream with correct headers. If streaming isn't appearing in the browser, check that Content-Type is set to text/event-stream and the response body is not buffered.

Cloudflare AI Workers AI Fix 4 min read

Cloudflare Workers AI Not Working? Fix Binding, Model & Rate Limit Errors

Q: Cloudflare Workers AI binding not found?

The AI binding must be declared in wrangler.toml before it can be used. Add: [ai] binding = "AI". Then in your Worker: const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt: 'Hello' }). Without the binding declaration, env.AI will be undefined and you'll get a TypeError.

Troubleshoot Cloudflare Workers AI errors — AI binding not configured in wrangler.toml, model not found or unavailable, neurons/day rate limit exceeded, local dev with --remote flag, and streaming with SSE.

Cloudflare Workers AI — live status

Updated every 5 minutes · Full incident history →

Full status →

Common errors and fixes

AI binding not configured in wrangler.toml

The AI binding must be declared in wrangler.toml before it can be used in your Worker. Without it, env.AI will be undefined and you'll get a TypeError.

# wrangler.toml — add this block
name = "my-worker"
main = "src/index.ts"
compatibility_date = "2024-01-01"

[ai]
binding = "AI"

Then in your Worker TypeScript:

export interface Env {
  AI: Ai; // type from @cloudflare/workers-types
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'What is the capital of France?',
    });
    return Response.json(response);
  },
};

Note: run npx wrangler types to regenerate type definitions after adding the AI binding.

Local dev not working — must use --remote flag

Workers AI inference runs on Cloudflare's GPU network, not locally. The standard wrangler dev command doesn't support AI.

# ❌ Won't work — AI binding unavailable locally
wrangler dev

# ✅ Use --remote to run inference on Cloudflare's network
wrangler dev --remote

Note: --remote requires a Cloudflare account and active login (wrangler login). Your account is billed for usage during remote dev.

Wrong model name / model not found

Correct model ID format: @cf/[author]/[model-name]. Common mistake: using just llama-3.1-8b without the @cf/meta/ prefix.

Text generation: @cf/meta/llama-3.1-8b-instruct (free), @cf/meta/llama-3.3-70b-instruct-fp8-fast (paid)
Embeddings: @cf/baai/bge-base-en-v1.5, @cf/baai/bge-large-en-v1.5
Image generation: @cf/black-forest-labs/flux-1-schnell
Speech to text: @cf/openai/whisper
Translation: @cf/meta/m2m100-1.2b

Browse the full catalog at developers.cloudflare.com/workers-ai/models/. Some models (like Llama 70B) require the Workers Paid plan.

Rate limit exceeded (neurons/day)

The free tier allows 10,000 neurons/day — roughly 1 neuron per input token for text models. Add KV caching to reduce AI calls for repeated prompts:

// Add KV caching to reduce AI calls
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json() as { prompt: string };
    const cacheKey = `ai:${crypto.subtle ? '' : ''}${btoa(prompt).slice(0, 32)}`;

    // Check cache first
    const cached = await env.KV.get(cacheKey);
    if (cached) return Response.json(JSON.parse(cached));

    // Run inference
    const result = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt });

    // Cache for 1 hour
    await env.KV.put(cacheKey, JSON.stringify(result), { expirationTtl: 3600 });
    return Response.json(result);
  },
};

Use smaller models: 8B models use fewer neurons per request than 70B models.
Upgrade plan: Workers Paid ($5/month) unlocks much higher neuron limits.
Monitor usage: Cloudflare Dashboard → Workers & Pages → Workers AI → Usage tab.

Streaming with SSE

Call the model with stream: true and return the response as a ReadableStream with Content-Type: text/event-stream:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'Write a short story',
      stream: true, // ← enable streaming
    }) as ReadableStream;

    return new Response(response, {
      headers: {
        'Content-Type': 'text/event-stream',
        'Cache-Control': 'no-cache',
      },
    });
  },
};

Client-side consumption: use EventSource or fetch with a ReadableStream reader to consume the SSE stream.
Model support: not all models support streaming — check the model's documentation page before enabling stream: true.
Buffering check: if tokens aren't appearing in the browser, verify that Content-Type is text/event-stream and the response body is not being buffered by a middleware layer.

🔔

Know when Cloudflare Workers AI has an outage

Free email alerts. Star Cloudflare AI on Prismix — no credit card needed.

View status Sign in free →

FAQ

Cloudflare Workers AI vs OpenAI API — when to use each?

Workers AI runs on Cloudflare's network co-located with your Worker, giving ultra-low latency for inference that happens alongside your edge logic. OpenAI has more capable models and more model options. Use Workers AI when you want zero additional infrastructure, free-tier inference, and data stays within your Cloudflare account. Use OpenAI for the most capable models (GPT-4o, o1).

Can I use Workers AI for embeddings + vector search?

Yes. Use @cf/baai/bge-base-en-v1.5 for embeddings and Vectorize (Cloudflare's vector DB) for storage and search. This combination stays entirely within Cloudflare's network with no external API calls.

Workers AI vs Cloudflare AI Gateway — what's the difference?

Workers AI is the inference platform (runs models). AI Gateway is a proxy/caching layer in front of external AI providers (OpenAI, Anthropic, etc.). They solve different problems: Workers AI for on-Cloudflare inference, AI Gateway for managing calls to third-party APIs.

Monitor related services

Cloudflare Workers AI status → OpenAI API not working → Anthropic status → All AI status → All guides →