Free 4 min read

Groq API Not Working?

Groq API returning 401 (gsk_ key format), 429 rate limit, model not found, context length exceeded, or streaming failing? Check live status and fix it fast.

Groq Cloud — live status

Updated every 5 minutes. Full history at prismix.dev/service/groq.

Full status →

What's wrong? Diagnose fast

🔑

API 401 — invalid API key

Keys start with gsk_. Header: Authorization: Bearer gsk_YOUR_KEY. Generate at console.groq.com/keys. Using OpenAI SDK? Set base_url="https://api.groq.com/openai/v1" and api_key="gsk_YOUR_KEY".

🚫

429 — rate limit exceeded

Free tier: LLaMA 3.3 70B = 6k TPM, LLaMA 3.1 8B = 30k TPM, Mixtral = 5k TPM. Fix: use LLaMA 8B for speed-insensitive tasks (5× higher limit). Add billing info at console.groq.com to increase limits. Retry with exponential backoff.

🔍

404 — model not found

Use exact model ID strings: llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768, gemma2-9b-it. Model IDs are versioned and change. Get current list from console.groq.com/docs/models.

📏

Context length exceeded

LLaMA 3 8B + 70B = 128K tokens. Mixtral 8x7B = 32K tokens. Gemma2 9B = 8K tokens. Trim your messages or implement a sliding-window context. Count tokens before sending with tiktoken or the model's tokenizer.

📡

Streaming not working

Python: set stream=True. JS/TS: stream: true. Read the full SSE stream until the [DONE] marker. If stream stops early: check network timeout (set 60s+). Some proxies buffer SSE — switch to a direct connection or disable buffering.

⚡

Slow response / not fast

Groq LPU is fastest for single-turn completions. Latency increases with context length and complex multi-turn conversations. Cold start: first request after idle may take 1-2s. Batch requests hit rate limits faster — prefer sequential with backoff.

Groq API quick reference

curl (OpenAI-compatible)

curl -X POST "https://api.groq.com/openai/v1/chat/completions" \
  -H "Authorization: Bearer gsk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-versatile",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one sentence"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Python (Groq SDK)

from groq import Groq

client = Groq(api_key="gsk_YOUR_KEY")

chat = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Hello, fast world!"}],
)
print(chat.choices[0].message.content)

Python (OpenAI SDK — drop-in replacement)

from openai import OpenAI

client = OpenAI(
    api_key="gsk_YOUR_KEY",
    base_url="https://api.groq.com/openai/v1",
)

chat = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}],
)

Groq free tier rate limits by model

Model	Model ID	TPM	RPM	Context
LLaMA 3.3 70B	llama-3.3-70b-versatile	6,000	30	128K
LLaMA 3.1 8B	llama-3.1-8b-instant	30,000	30	128K
LLaMA 3.1 70B	llama-3.1-70b-versatile	6,000	30	128K
Mixtral 8x7B	mixtral-8x7b-32768	5,000	30	32K
Gemma2 9B	gemma2-9b-it	15,000	30	8K

TPM = tokens per minute. RPM = requests per minute. Add billing at console.groq.com to unlock pay-as-you-go higher limits. Check console.groq.com/docs/models for the latest list.

Step-by-step fix

1

Check live Groq status

Visit prismix.dev/service/groq and groqstatus.com. If the API component is operational, the issue is authentication or rate limits.
2

Fix API 401 authentication

Your API key must start with gsk_. Generate at console.groq.com/keys. Use header: Authorization: Bearer gsk_YOUR_KEY. If using the OpenAI SDK, set base_url="https://api.groq.com/openai/v1" in the client constructor.
3

Fix 429 rate limit errors

Free tier LLaMA 3.3 70B: 6,000 tokens/min. For lighter tasks: switch to llama-3.1-8b-instant (30,000 TPM — 5× higher limit). Implement exponential backoff on 429 responses. To increase limits permanently: add a payment method at console.groq.com/settings/billing.
4

Fix model not found (404)

Use exact model IDs: llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768. Short names like "llama3" or "mistral" return 404. Get the current full list: GET https://api.groq.com/openai/v1/models.
5

Fix context length errors

Trim your messages to fit: LLaMA 3 = 128K tokens max, Mixtral = 32K tokens, Gemma2 = 8K tokens. Count tokens before sending. For long documents: use a sliding-window approach or summarize older turns. Truncation from the start of the conversation (keep system prompt + recent messages) preserves coherence.

🔔

Get alerted when Groq goes down

Star Groq on Prismix and get emailed the moment status changes. Free, no credit card.