2000+ tok/s 4 min read

Cerebras Not Working?

API 401 (csk- key, Bearer auth), model not found (use llama3.1-8b not gpt-4o names), rate limit 429 (30 RPM — Cerebras is so fast you burn it instantly), 8k context limit, or 503 outage? Fix it fast.

Cerebras live status

Cerebras — live status

Updated every 5 minutes. Full history at prismix.dev/service/cerebras.

Full status →

What's wrong? Diagnose fast

🔑

API 401 — Bearer auth with csk- key

Standard Bearer auth: Authorization: Bearer csk-YOUR_KEY. Key starts with "csk-", generated at cloud.cerebras.ai/platform/apikeys. OpenAI SDK: OpenAI(base_url="https://api.cerebras.ai/v1", api_key="csk-YOUR_KEY"). Official SDK: pip install cerebras-cloud-sdk.

💬

Model 404 — wrong model IDs

Cerebras model IDs are NOT OpenAI or Anthropic names. Use: llama3.1-8b, llama3.1-70b, llama-4-scout-17b-16e-instruct, llama-4-maverick-17b-128e-instruct, qwen-3-32b. List all available via GET /v1/models. Do not use "gpt-4o", "claude-3-5-sonnet", or "llama-3" — those are not valid Cerebras model IDs.

429 rate limit — exhausted at 2000+ tok/s

Free tier: 30 RPM / 60,000 TPM. At Cerebras speeds, one request can generate 60k tokens in 30 seconds. Mitigation: exponential backoff on 429, reduce max_tokens, use streaming, upgrade to paid tier. Check the Retry-After header for when the limit resets.

📏

Context 400 — 8k token limit on Llama 3.1

Cerebras Llama 3.1 models max out at 8,192 tokens total (prompt + completion combined). This is much shorter than GPT-4o (128k) or Claude 3.5 (200k). For long conversations: summarize history before sending, or switch to llama-4-maverick (16k context).

🛑

503 service unavailable

Cerebras is a single-region provider — 503 means peak load or maintenance. Check prismix.dev/service/cerebras for live status. Retry with exponential backoff: start at 1s, cap at 30s. 503 usually resolves within minutes unless there is a major outage.

✂️

Streaming cut off early

If streaming responses are cut off: (1) set a higher max_tokens — Cerebras defaults to max_tokens=512 if not specified, which cuts off long responses; (2) check your HTTP client timeout — at 2000+ tok/s even long responses finish fast, but set timeout to at least 30s; (3) ensure your streaming event loop is consuming the entire SSE stream before closing.

Cerebras API quick reference

OpenAI SDK (drop-in, recommended)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.cerebras.ai/v1",
    api_key="csk-YOUR_KEY",  # key starts with csk-
)

response = client.chat.completions.create(
    model="llama3.1-70b",   # NOT "gpt-4o" or similar
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=1024,         # defaults to 512 if not set
)

Official Cerebras Python SDK

pip install cerebras-cloud-sdk

from cerebras.cloud.sdk import Cerebras

client = Cerebras(api_key="csk-YOUR_KEY")

response = client.chat.completions.create(
    model="llama3.1-70b",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=1024,
)
print(response.choices[0].message.content)

curl with exponential backoff on 429

curl https://api.cerebras.ai/v1/chat/completions \
  -H "Authorization: Bearer csk-YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-8b",
    "messages": [{"role": "user", "content": "Hi"}],
    "max_tokens": 512
  }'
# On 429: check Retry-After header, sleep, then retry

Cerebras models quick reference

Model ID Context Speed
llama3.1-8b 8,192 tokens Fastest — 2,000+ tok/s
llama3.1-70b 8,192 tokens Fast — 1,400+ tok/s
llama-4-scout-17b-16e-instruct 10,000 tokens ~1,000 tok/s
llama-4-maverick-17b-128e-instruct 16,000 tokens ~800 tok/s, best quality
qwen-3-32b 16,000 tokens ~700 tok/s

Get current list: GET https://api.cerebras.ai/v1/models (auth required)

Step-by-step fix

  1. 1

    Check live Cerebras status

    Visit prismix.dev/service/cerebras. Cerebras is single-region — outages affect all users simultaneously.

  2. 2

    Fix API 401

    Key format: csk-YOUR_KEY. Auth: Authorization: Bearer csk-YOUR_KEY. Generate at cloud.cerebras.ai/platform/apikeys. OpenAI SDK: OpenAI(base_url="https://api.cerebras.ai/v1", api_key="csk-YOUR_KEY").

  3. 3

    Fix model not found

    Use Cerebras model IDs: llama3.1-8b, llama3.1-70b, llama-4-maverick-17b-128e-instruct. NOT "gpt-4o" or "claude-3-5-sonnet". List available: GET /v1/models.

  4. 4

    Fix rate limit 429

    Free tier: 30 RPM / 60k TPM. At 2000+ tok/s you can burn through this fast. Add exponential backoff, check Retry-After header. Set max_tokens explicitly to control generation length.

  5. 5

    Fix context length error

    Llama 3.1 models: 8,192 token max (prompt + output). Summarize conversation history before sending, or switch to llama-4-maverick (16k context). Cerebras context limits are much smaller than OpenAI — plan accordingly.

🔔

Get alerted when Cerebras goes down

Star Cerebras on Prismix and get emailed the moment status changes. Free, no credit card.

Frequently asked questions

Why is Cerebras not working?

Cerebras issues: (1) 401 — key starts with csk-, header: Authorization: Bearer csk-KEY; (2) model 404 — use llama3.1-8b or llama3.1-70b (not OpenAI names); (3) 429 — 30 RPM free tier, add backoff; (4) context 400 — max 8k tokens for Llama 3.1; (5) 503 — check prismix.dev/service/cerebras for outage.

Is Cerebras down right now?

Check prismix.dev/service/cerebras for live status.

What models does Cerebras support?

Cerebras models (exact IDs): llama3.1-8b (8k ctx, fastest), llama3.1-70b (8k ctx, best Llama 3.1 quality), llama-4-scout-17b-16e-instruct (10k ctx), llama-4-maverick-17b-128e-instruct (16k ctx, best quality), qwen-3-32b (16k ctx). List with GET https://api.cerebras.ai/v1/models.

Can I use OpenAI SDK with Cerebras?

Yes — Cerebras is OpenAI API-compatible. Set base_url="https://api.cerebras.ai/v1" and use your csk- key as api_key. The chat completions endpoint works identically. You can also use the official Cerebras SDK: pip install cerebras-cloud-sdk.

Why does Cerebras rate limit so quickly at 2000+ tokens/sec?

Cerebras free tier caps at 60,000 tokens per minute (TPM). At 2000+ tokens/sec, a single generation can burn through 60k tokens in 30 seconds — exhausting the entire minute's budget in one request. Fix: set lower max_tokens, add Retry-After-based backoff, or upgrade to a paid plan. Paid tiers offer much higher TPM limits.

Related fast LLM inference providers