Groq API Not Working?
Groq API returning 401 (gsk_ key format), 429 rate limit, model not found, context length exceeded, or streaming failing? Check live status and fix it fast.
Groq Cloud — live status
Updated every 5 minutes. Full history at prismix.dev/service/groq.
What's wrong? Diagnose fast
API 401 — invalid API key
Keys start with gsk_. Header: Authorization: Bearer gsk_YOUR_KEY. Generate at console.groq.com/keys. Using OpenAI SDK? Set base_url="https://api.groq.com/openai/v1" and api_key="gsk_YOUR_KEY".
429 — rate limit exceeded
Free tier: LLaMA 3.3 70B = 6k TPM, LLaMA 3.1 8B = 30k TPM, Mixtral = 5k TPM. Fix: use LLaMA 8B for speed-insensitive tasks (5× higher limit). Add billing info at console.groq.com to increase limits. Retry with exponential backoff.
404 — model not found
Use exact model ID strings: llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768, gemma2-9b-it. Model IDs are versioned and change. Get current list from console.groq.com/docs/models.
Context length exceeded
LLaMA 3 8B + 70B = 128K tokens. Mixtral 8x7B = 32K tokens. Gemma2 9B = 8K tokens. Trim your messages or implement a sliding-window context. Count tokens before sending with tiktoken or the model's tokenizer.
Streaming not working
Python: set stream=True. JS/TS: stream: true. Read the full SSE stream until the [DONE] marker. If stream stops early: check network timeout (set 60s+). Some proxies buffer SSE — switch to a direct connection or disable buffering.
Slow response / not fast
Groq LPU is fastest for single-turn completions. Latency increases with context length and complex multi-turn conversations. Cold start: first request after idle may take 1-2s. Batch requests hit rate limits faster — prefer sequential with backoff.
Groq API quick reference
curl (OpenAI-compatible)
curl -X POST "https://api.groq.com/openai/v1/chat/completions" \
-H "Authorization: Bearer gsk_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b-versatile",
"messages": [
{"role": "user", "content": "Explain quantum computing in one sentence"}
],
"temperature": 0.7,
"max_tokens": 256
}' Python (Groq SDK)
from groq import Groq
client = Groq(api_key="gsk_YOUR_KEY")
chat = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": "Hello, fast world!"}],
)
print(chat.choices[0].message.content) Python (OpenAI SDK — drop-in replacement)
from openai import OpenAI
client = OpenAI(
api_key="gsk_YOUR_KEY",
base_url="https://api.groq.com/openai/v1",
)
chat = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Hello!"}],
) Groq free tier rate limits by model
| Model | Model ID | TPM | RPM | Context |
|---|---|---|---|---|
| LLaMA 3.3 70B | llama-3.3-70b-versatile | 6,000 | 30 | 128K |
| LLaMA 3.1 8B | llama-3.1-8b-instant | 30,000 | 30 | 128K |
| LLaMA 3.1 70B | llama-3.1-70b-versatile | 6,000 | 30 | 128K |
| Mixtral 8x7B | mixtral-8x7b-32768 | 5,000 | 30 | 32K |
| Gemma2 9B | gemma2-9b-it | 15,000 | 30 | 8K |
TPM = tokens per minute. RPM = requests per minute. Add billing at console.groq.com to unlock pay-as-you-go higher limits. Check console.groq.com/docs/models for the latest list.
Step-by-step fix
- 1
Check live Groq status
Visit prismix.dev/service/groq and groqstatus.com. If the API component is operational, the issue is authentication or rate limits.
- 2
Fix API 401 authentication
Your API key must start with
gsk_. Generate at console.groq.com/keys. Use header:Authorization: Bearer gsk_YOUR_KEY. If using the OpenAI SDK, setbase_url="https://api.groq.com/openai/v1"in the client constructor. - 3
Fix 429 rate limit errors
Free tier LLaMA 3.3 70B: 6,000 tokens/min. For lighter tasks: switch to
llama-3.1-8b-instant(30,000 TPM — 5× higher limit). Implement exponential backoff on 429 responses. To increase limits permanently: add a payment method at console.groq.com/settings/billing. - 4
Fix model not found (404)
Use exact model IDs:
llama-3.3-70b-versatile,llama-3.1-8b-instant,mixtral-8x7b-32768. Short names like "llama3" or "mistral" return 404. Get the current full list:GET https://api.groq.com/openai/v1/models. - 5
Fix context length errors
Trim your messages to fit: LLaMA 3 = 128K tokens max, Mixtral = 32K tokens, Gemma2 = 8K tokens. Count tokens before sending. For long documents: use a sliding-window approach or summarize older turns. Truncation from the start of the conversation (keep system prompt + recent messages) preserves coherence.
Get alerted when Groq goes down
Star Groq on Prismix and get emailed the moment status changes. Free, no credit card.
Frequently asked questions
Why is Groq API not working?
Groq API issues: (1) 401 (key must start with gsk_, header: Authorization: Bearer gsk_KEY); (2) 429 rate limit (free: LLaMA 70B 6k TPM, use LLaMA 8B at 30k TPM instead); (3) 404 model not found (use exact ID like llama-3.3-70b-versatile); (4) context too long (LLaMA 3 max 128K, Mixtral max 32K); (5) outage (check prismix.dev/service/groq).
Is Groq down right now?
Check prismix.dev/service/groq for live Groq Cloud status. Also see groqstatus.com for official component-level incident reports.
What are Groq free tier rate limits?
Groq free tier per model per minute: LLaMA 3.3 70B Versatile: 6,000 TPM, 30 RPM. LLaMA 3.1 8B Instant: 30,000 TPM, 30 RPM. Mixtral 8x7B: 5,000 TPM, 30 RPM. Gemma2 9B: 15,000 TPM, 30 RPM. Add a payment method at console.groq.com for higher pay-as-you-go limits.
How to use Groq with the OpenAI Python SDK?
Groq is OpenAI-compatible. Python: from openai import OpenAI; client = OpenAI(api_key="gsk_YOUR_KEY", base_url="https://api.groq.com/openai/v1"). Then use client.chat.completions.create() with Groq model IDs (llama-3.3-70b-versatile) instead of OpenAI model names (gpt-4). The rest of the API is identical.
Groq vs OpenAI vs Together AI — which is faster?
Groq is the fastest LLM inference due to its custom LPU hardware: LLaMA 3.1 8B at 750-1500 tokens/second vs OpenAI GPT-4o at 50-80 tokens/second. Together AI and Fireworks offer similar speed on smaller models. Groq tradeoffs: smaller model selection, stricter free rate limits, and no GPT-4/Claude support. For latency-critical real-time apps with open-source models, Groq is unmatched.