Free 4 min read

Groq API Not Working?

Groq API returning 401 (gsk_ key format), 429 rate limit, model not found, context length exceeded, or streaming failing? Check live status and fix it fast.

Groq live status

Groq Cloud — live status

Updated every 5 minutes. Full history at prismix.dev/service/groq.

Full status →

What's wrong? Diagnose fast

🔑

API 401 — invalid API key

Keys start with gsk_. Header: Authorization: Bearer gsk_YOUR_KEY. Generate at console.groq.com/keys. Using OpenAI SDK? Set base_url="https://api.groq.com/openai/v1" and api_key="gsk_YOUR_KEY".

🚫

429 — rate limit exceeded

Free tier: LLaMA 3.3 70B = 6k TPM, LLaMA 3.1 8B = 30k TPM, Mixtral = 5k TPM. Fix: use LLaMA 8B for speed-insensitive tasks (5× higher limit). Add billing info at console.groq.com to increase limits. Retry with exponential backoff.

🔍

404 — model not found

Use exact model ID strings: llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768, gemma2-9b-it. Model IDs are versioned and change. Get current list from console.groq.com/docs/models.

📏

Context length exceeded

LLaMA 3 8B + 70B = 128K tokens. Mixtral 8x7B = 32K tokens. Gemma2 9B = 8K tokens. Trim your messages or implement a sliding-window context. Count tokens before sending with tiktoken or the model's tokenizer.

📡

Streaming not working

Python: set stream=True. JS/TS: stream: true. Read the full SSE stream until the [DONE] marker. If stream stops early: check network timeout (set 60s+). Some proxies buffer SSE — switch to a direct connection or disable buffering.

Slow response / not fast

Groq LPU is fastest for single-turn completions. Latency increases with context length and complex multi-turn conversations. Cold start: first request after idle may take 1-2s. Batch requests hit rate limits faster — prefer sequential with backoff.

Groq API quick reference

curl (OpenAI-compatible)

curl -X POST "https://api.groq.com/openai/v1/chat/completions" \
  -H "Authorization: Bearer gsk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-versatile",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one sentence"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Python (Groq SDK)

from groq import Groq

client = Groq(api_key="gsk_YOUR_KEY")

chat = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Hello, fast world!"}],
)
print(chat.choices[0].message.content)

Python (OpenAI SDK — drop-in replacement)

from openai import OpenAI

client = OpenAI(
    api_key="gsk_YOUR_KEY",
    base_url="https://api.groq.com/openai/v1",
)

chat = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}],
)

Groq free tier rate limits by model

Model Model ID TPM RPM Context
LLaMA 3.3 70B llama-3.3-70b-versatile 6,000 30 128K
LLaMA 3.1 8B llama-3.1-8b-instant 30,000 30 128K
LLaMA 3.1 70B llama-3.1-70b-versatile 6,000 30 128K
Mixtral 8x7B mixtral-8x7b-32768 5,000 30 32K
Gemma2 9B gemma2-9b-it 15,000 30 8K

TPM = tokens per minute. RPM = requests per minute. Add billing at console.groq.com to unlock pay-as-you-go higher limits. Check console.groq.com/docs/models for the latest list.

Step-by-step fix

  1. 1

    Check live Groq status

    Visit prismix.dev/service/groq and groqstatus.com. If the API component is operational, the issue is authentication or rate limits.

  2. 2

    Fix API 401 authentication

    Your API key must start with gsk_. Generate at console.groq.com/keys. Use header: Authorization: Bearer gsk_YOUR_KEY. If using the OpenAI SDK, set base_url="https://api.groq.com/openai/v1" in the client constructor.

  3. 3

    Fix 429 rate limit errors

    Free tier LLaMA 3.3 70B: 6,000 tokens/min. For lighter tasks: switch to llama-3.1-8b-instant (30,000 TPM — 5× higher limit). Implement exponential backoff on 429 responses. To increase limits permanently: add a payment method at console.groq.com/settings/billing.

  4. 4

    Fix model not found (404)

    Use exact model IDs: llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768. Short names like "llama3" or "mistral" return 404. Get the current full list: GET https://api.groq.com/openai/v1/models.

  5. 5

    Fix context length errors

    Trim your messages to fit: LLaMA 3 = 128K tokens max, Mixtral = 32K tokens, Gemma2 = 8K tokens. Count tokens before sending. For long documents: use a sliding-window approach or summarize older turns. Truncation from the start of the conversation (keep system prompt + recent messages) preserves coherence.

🔔

Get alerted when Groq goes down

Star Groq on Prismix and get emailed the moment status changes. Free, no credit card.

Frequently asked questions

Why is Groq API not working?

Groq API issues: (1) 401 (key must start with gsk_, header: Authorization: Bearer gsk_KEY); (2) 429 rate limit (free: LLaMA 70B 6k TPM, use LLaMA 8B at 30k TPM instead); (3) 404 model not found (use exact ID like llama-3.3-70b-versatile); (4) context too long (LLaMA 3 max 128K, Mixtral max 32K); (5) outage (check prismix.dev/service/groq).

Is Groq down right now?

Check prismix.dev/service/groq for live Groq Cloud status. Also see groqstatus.com for official component-level incident reports.

What are Groq free tier rate limits?

Groq free tier per model per minute: LLaMA 3.3 70B Versatile: 6,000 TPM, 30 RPM. LLaMA 3.1 8B Instant: 30,000 TPM, 30 RPM. Mixtral 8x7B: 5,000 TPM, 30 RPM. Gemma2 9B: 15,000 TPM, 30 RPM. Add a payment method at console.groq.com for higher pay-as-you-go limits.

How to use Groq with the OpenAI Python SDK?

Groq is OpenAI-compatible. Python: from openai import OpenAI; client = OpenAI(api_key="gsk_YOUR_KEY", base_url="https://api.groq.com/openai/v1"). Then use client.chat.completions.create() with Groq model IDs (llama-3.3-70b-versatile) instead of OpenAI model names (gpt-4). The rest of the API is identical.

Groq vs OpenAI vs Together AI — which is faster?

Groq is the fastest LLM inference due to its custom LPU hardware: LLaMA 3.1 8B at 750-1500 tokens/second vs OpenAI GPT-4o at 50-80 tokens/second. Together AI and Fireworks offer similar speed on smaller models. Groq tradeoffs: smaller model selection, stricter free rate limits, and no GPT-4/Claude support. For latency-critical real-time apps with open-source models, Groq is unmatched.

Related AI inference APIs