Cerebras API 401 — how to fix?

Cerebras uses standard Bearer auth. Header: Authorization: Bearer csk-YOUR_KEY. Generate keys at cloud.cerebras.ai/platform/apikeys. Base URL: https://api.cerebras.ai/v1. Python (OpenAI SDK): from openai import OpenAI; client = OpenAI(base_url='https://api.cerebras.ai/v1', api_key='csk-YOUR_KEY'). There is also an official Cerebras Python SDK: pip install cerebras-cloud-sdk.

Cerebras rate limit 429 — how to fix?

Cerebras free tier: 30 RPM / 60,000 TPM. Because Cerebras runs at 2000+ tokens/sec, you can exhaust this in one long generation. Fix: (1) add exponential backoff on 429 (retry after header shows reset time); (2) reduce max_tokens per request; (3) use streaming so you can react to context sooner; (4) upgrade plan for higher limits. Paid tiers offer significantly higher RPM.

2000+ tok/s 4 min read

Cerebras Not Working?

API 401 (csk- key, Bearer auth), model not found (use llama3.1-8b not gpt-4o names), rate limit 429 (30 RPM — Cerebras is so fast you burn it instantly), 8k context limit, or 503 outage? Fix it fast.

Cerebras — live status

Updated every 5 minutes. Full history at prismix.dev/service/cerebras.

Full status →

What's wrong? Diagnose fast

🔑

API 401 — Bearer auth with csk- key

Standard Bearer auth: Authorization: Bearer csk-YOUR_KEY. Key starts with "csk-", generated at cloud.cerebras.ai/platform/apikeys. OpenAI SDK: OpenAI(base_url="https://api.cerebras.ai/v1", api_key="csk-YOUR_KEY"). Official SDK: pip install cerebras-cloud-sdk.

💬

Model 404 — wrong model IDs

Cerebras model IDs are NOT OpenAI or Anthropic names. Use: llama3.1-8b, llama3.1-70b, llama-4-scout-17b-16e-instruct, llama-4-maverick-17b-128e-instruct, qwen-3-32b. List all available via GET /v1/models. Do not use "gpt-4o", "claude-3-5-sonnet", or "llama-3" — those are not valid Cerebras model IDs.

⚡

429 rate limit — exhausted at 2000+ tok/s

Free tier: 30 RPM / 60,000 TPM. At Cerebras speeds, one request can generate 60k tokens in 30 seconds. Mitigation: exponential backoff on 429, reduce max_tokens, use streaming, upgrade to paid tier. Check the Retry-After header for when the limit resets.

📏

Context 400 — 8k token limit on Llama 3.1

Cerebras Llama 3.1 models max out at 8,192 tokens total (prompt + completion combined). This is much shorter than GPT-4o (128k) or Claude 3.5 (200k). For long conversations: summarize history before sending, or switch to llama-4-maverick (16k context).

🛑

503 service unavailable

Cerebras is a single-region provider — 503 means peak load or maintenance. Check prismix.dev/service/cerebras for live status. Retry with exponential backoff: start at 1s, cap at 30s. 503 usually resolves within minutes unless there is a major outage.

✂️

Streaming cut off early

If streaming responses are cut off: (1) set a higher max_tokens — Cerebras defaults to max_tokens=512 if not specified, which cuts off long responses; (2) check your HTTP client timeout — at 2000+ tok/s even long responses finish fast, but set timeout to at least 30s; (3) ensure your streaming event loop is consuming the entire SSE stream before closing.

Cerebras API quick reference

OpenAI SDK (drop-in, recommended)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.cerebras.ai/v1",
    api_key="csk-YOUR_KEY",  # key starts with csk-
)

response = client.chat.completions.create(
    model="llama3.1-70b",   # NOT "gpt-4o" or similar
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=1024,         # defaults to 512 if not set
)

Official Cerebras Python SDK

pip install cerebras-cloud-sdk

from cerebras.cloud.sdk import Cerebras

client = Cerebras(api_key="csk-YOUR_KEY")

response = client.chat.completions.create(
    model="llama3.1-70b",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=1024,
)
print(response.choices[0].message.content)

curl with exponential backoff on 429

curl https://api.cerebras.ai/v1/chat/completions \
  -H "Authorization: Bearer csk-YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-8b",
    "messages": [{"role": "user", "content": "Hi"}],
    "max_tokens": 512
  }'
# On 429: check Retry-After header, sleep, then retry

Cerebras models quick reference

Model ID	Context	Speed
llama3.1-8b	8,192 tokens	Fastest — 2,000+ tok/s
llama3.1-70b	8,192 tokens	Fast — 1,400+ tok/s
llama-4-scout-17b-16e-instruct	10,000 tokens	~1,000 tok/s
llama-4-maverick-17b-128e-instruct	16,000 tokens	~800 tok/s, best quality
qwen-3-32b	16,000 tokens	~700 tok/s

Get current list: GET https://api.cerebras.ai/v1/models (auth required)

Step-by-step fix

1

Check live Cerebras status

Visit prismix.dev/service/cerebras. Cerebras is single-region — outages affect all users simultaneously.
2

Fix API 401

Key format: csk-YOUR_KEY. Auth: Authorization: Bearer csk-YOUR_KEY. Generate at cloud.cerebras.ai/platform/apikeys. OpenAI SDK: OpenAI(base_url="https://api.cerebras.ai/v1", api_key="csk-YOUR_KEY").
3

Fix model not found

Use Cerebras model IDs: llama3.1-8b, llama3.1-70b, llama-4-maverick-17b-128e-instruct. NOT "gpt-4o" or "claude-3-5-sonnet". List available: GET /v1/models.
4

Fix rate limit 429

Free tier: 30 RPM / 60k TPM. At 2000+ tok/s you can burn through this fast. Add exponential backoff, check Retry-After header. Set max_tokens explicitly to control generation length.
5

Fix context length error

Llama 3.1 models: 8,192 token max (prompt + output). Summarize conversation history before sending, or switch to llama-4-maverick (16k context). Cerebras context limits are much smaller than OpenAI — plan accordingly.

🔔

Get alerted when Cerebras goes down

Star Cerebras on Prismix and get emailed the moment status changes. Free, no credit card.