Cloudflare AI Workers AI Fix 4 min read

Cloudflare Workers AI Not Working? Fix Binding, Model & Rate Limit Errors

Troubleshoot Cloudflare Workers AI errors — AI binding not configured in wrangler.toml, model not found or unavailable, neurons/day rate limit exceeded, local dev with --remote flag, and streaming with SSE.

Cloudflare Workers AI live status

Cloudflare Workers AI — live status

Updated every 5 minutes · Full incident history →

Full status →

Common errors and fixes

AI binding not configured in wrangler.toml

The AI binding must be declared in wrangler.toml before it can be used in your Worker. Without it, env.AI will be undefined and you'll get a TypeError.

# wrangler.toml — add this block
name = "my-worker"
main = "src/index.ts"
compatibility_date = "2024-01-01"

[ai]
binding = "AI"

Then in your Worker TypeScript:

export interface Env {
  AI: Ai; // type from @cloudflare/workers-types
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'What is the capital of France?',
    });
    return Response.json(response);
  },
};
Note: run npx wrangler types to regenerate type definitions after adding the AI binding.

Local dev not working — must use --remote flag

Workers AI inference runs on Cloudflare's GPU network, not locally. The standard wrangler dev command doesn't support AI.

# ❌ Won't work — AI binding unavailable locally
wrangler dev

# ✅ Use --remote to run inference on Cloudflare's network
wrangler dev --remote
Note: --remote requires a Cloudflare account and active login (wrangler login). Your account is billed for usage during remote dev.

Wrong model name / model not found

Correct model ID format: @cf/[author]/[model-name]. Common mistake: using just llama-3.1-8b without the @cf/meta/ prefix.

  • Text generation: @cf/meta/llama-3.1-8b-instruct (free), @cf/meta/llama-3.3-70b-instruct-fp8-fast (paid)
  • Embeddings: @cf/baai/bge-base-en-v1.5, @cf/baai/bge-large-en-v1.5
  • Image generation: @cf/black-forest-labs/flux-1-schnell
  • Speech to text: @cf/openai/whisper
  • Translation: @cf/meta/m2m100-1.2b

Browse the full catalog at developers.cloudflare.com/workers-ai/models/. Some models (like Llama 70B) require the Workers Paid plan.

Rate limit exceeded (neurons/day)

The free tier allows 10,000 neurons/day — roughly 1 neuron per input token for text models. Add KV caching to reduce AI calls for repeated prompts:

// Add KV caching to reduce AI calls
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json() as { prompt: string };
    const cacheKey = `ai:${crypto.subtle ? '' : ''}${btoa(prompt).slice(0, 32)}`;

    // Check cache first
    const cached = await env.KV.get(cacheKey);
    if (cached) return Response.json(JSON.parse(cached));

    // Run inference
    const result = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt });

    // Cache for 1 hour
    await env.KV.put(cacheKey, JSON.stringify(result), { expirationTtl: 3600 });
    return Response.json(result);
  },
};
  • Use smaller models: 8B models use fewer neurons per request than 70B models.
  • Upgrade plan: Workers Paid ($5/month) unlocks much higher neuron limits.
  • Monitor usage: Cloudflare Dashboard → Workers & Pages → Workers AI → Usage tab.

Streaming with SSE

Call the model with stream: true and return the response as a ReadableStream with Content-Type: text/event-stream:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'Write a short story',
      stream: true, // ← enable streaming
    }) as ReadableStream;

    return new Response(response, {
      headers: {
        'Content-Type': 'text/event-stream',
        'Cache-Control': 'no-cache',
      },
    });
  },
};
  • Client-side consumption: use EventSource or fetch with a ReadableStream reader to consume the SSE stream.
  • Model support: not all models support streaming — check the model's documentation page before enabling stream: true.
  • Buffering check: if tokens aren't appearing in the browser, verify that Content-Type is text/event-stream and the response body is not being buffered by a middleware layer.
🔔

Know when Cloudflare Workers AI has an outage

Free email alerts. Star Cloudflare AI on Prismix — no credit card needed.

FAQ

Cloudflare Workers AI vs OpenAI API — when to use each?

Workers AI runs on Cloudflare's network co-located with your Worker, giving ultra-low latency for inference that happens alongside your edge logic. OpenAI has more capable models and more model options. Use Workers AI when you want zero additional infrastructure, free-tier inference, and data stays within your Cloudflare account. Use OpenAI for the most capable models (GPT-4o, o1).

Can I use Workers AI for embeddings + vector search?

Yes. Use @cf/baai/bge-base-en-v1.5 for embeddings and Vectorize (Cloudflare's vector DB) for storage and search. This combination stays entirely within Cloudflare's network with no external API calls.

Workers AI vs Cloudflare AI Gateway — what's the difference?

Workers AI is the inference platform (runs models). AI Gateway is a proxy/caching layer in front of external AI providers (OpenAI, Anthropic, etc.). They solve different problems: Workers AI for on-Cloudflare inference, AI Gateway for managing calls to third-party APIs.

Monitor related services