Why is the Groq API not working?

Common causes: (1) 401 — invalid API key (get at console.groq.com); (2) 429 — rate limit exceeded; free tier has strict limits (30 RPM, 14,400 RPD); (3) model name wrong — Groq uses specific version strings like llama-3.3-70b-versatile not llama-3-70b; (4) context length exceeded — Groq's context windows vary by model; (5) Groq infrastructure issue — check prismix.dev/service/groq-api.

Groq API 401 — authentication error?

(1) Get API key at console.groq.com → API Keys → Create API Key; (2) set as env var GROQ_API_KEY; (3) use Authorization: Bearer header or pass via SDK; (4) Groq API keys start with gsk_ — if yours doesn't, it may be a legacy key format; (5) Groq's playground key is different from API keys — always use keys from the API Keys section.

Groq API rate limit 429?

Free tier limits per model: 30 requests/minute, 14,400 requests/day, 6,000 tokens/minute. Limits are per model — using different models doesn't share the limit. To resolve: (1) add retry with backoff; (2) rotate between multiple models (llama-3.3-70b + llama-3.1-8b) to effectively multiply rate limits; (3) upgrade to paid plan for higher limits; (4) use request queuing rather than parallel requests.

Groq API model not found?

Groq deprecates old model versions frequently. Current models (check console.groq.com/docs/models for latest): llama-3.3-70b-versatile (recommended), llama-3.1-8b-instant (fastest, cheapest), llama3-8b-8192 (older, being deprecated), mixtral-8x7b-32768 (high quality), gemma2-9b-it (Google Gemma). Do NOT use: llama3-70b-8192 (deprecated in favor of llama-3.3-70b-versatile).

Groq streaming not working?

Groq uses the OpenAI-compatible API for streaming. Set stream=True (Python) or stream: true (REST). The response is a Server-Sent Events stream — each data: line contains a JSON chunk. With the Groq SDK, use client.chat.completions.create(stream=True) and iterate the response. Do NOT call .content on a streaming response — iterate chunks.

Groq API LLM API Fix 4 min read

Groq API Not Working? Fix Auth, Rate Limits & Model Errors

Troubleshoot Groq API errors — 401 invalid API key, 429 rate limit exceeded (free tier 30 RPM), model not found, streaming issues, and context length exceeded on Llama and Mixtral models.

Groq API — live status

Updated every 5 minutes · Full incident history →

Full status →

Common errors and fixes

Authentication — 401 error

A 401 error means the API key is missing, invalid, or passed incorrectly. Use the Groq SDK or the OpenAI SDK pointed at Groq's base URL:

# Python — using groq SDK
from groq import Groq
import os

client = Groq(api_key=os.environ["GROQ_API_KEY"])

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

# Or using OpenAI SDK with Groq base URL
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1"
)

Get API key at console.groq.com — go to API Keys → Create API Key. Keys start with gsk_.
Set as env var GROQ_API_KEY — or pass via Authorization: Bearer header.
Legacy key format — if your key doesn't start with gsk_, it may be an old format; generate a new key.
Playground key is separate — Groq's playground uses its own session token, not an API key. Always use keys from the API Keys section.

Rate limits — free tier

Free tier limits apply per model, not globally — using different models doesn't share the same bucket:

Limit	Value
Requests/minute	30
Requests/day	14,400
Tokens/minute	6,000
Tokens/day	500,000

Strategy: rotate between models to effectively multiply your rate limit:

import itertools
import time

MODELS = ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"]
model_cycle = itertools.cycle(MODELS)

def groq_call(client, prompt):
    for attempt in range(5):
        try:
            return client.chat.completions.create(
                model=next(model_cycle),
                messages=[{"role": "user", "content": prompt}]
            )
        except Exception as e:
            if "429" in str(e) and attempt < 4:
                time.sleep((2 ** attempt) + 0.5)
            else:
                raise

Upgrade to paid plan for higher limits at console.groq.com.
Use request queuing rather than parallel requests to stay within the per-minute cap.

Current model names

Groq deprecates old model versions frequently. Always verify at console.groq.com/docs/models. As of June 2026:

Model ID	Notes
`llama-3.3-70b-versatile`	Best quality, recommended
`llama-3.1-8b-instant`	Fastest, lowest cost, 128K context
`llama3-8b-8192`	Older 8K context version (use llama-3.1-8b-instant instead)
`mixtral-8x7b-32768`	32K context, good quality
`gemma2-9b-it`	Google Gemma, fast
`llama-3.2-11b-vision-preview`	Vision/image input support
`whisper-large-v3`	Speech-to-text (audio API, not chat)

Deprecated (will return 404): llama2-70b-4096, llama3-70b-8192 — use llama-3.3-70b-versatile instead.

Context length exceeded

Each Groq model has different context limits. If you hit the limit, switch to a model with a larger context window:

Model	Context
`llama-3.3-70b-versatile`	128K tokens
`llama-3.1-8b-instant`	128K tokens
`mixtral-8x7b-32768`	32K tokens
`llama3-8b-8192`	8K tokens (old model)

For conversation history management, trim older messages to stay within limits:

def trim_messages(messages, max_tokens=100000):
    # Keep system message + last N messages
    system = [m for m in messages if m["role"] == "system"]
    conversation = [m for m in messages if m["role"] != "system"]
    # Simple trim: keep system + last 20 exchanges
    return system + conversation[-40:]

Streaming not working

Groq uses the OpenAI-compatible API for streaming. Iterate chunks — do NOT call .content on the streaming response object directly:

# Python streaming
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a short story"}],
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

// JavaScript streaming
const stream = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [{ role: "user", content: "Write a story" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Set stream=True (Python) or stream: true (REST/JS) in the request.
REST response is Server-Sent Events — each data: line contains a JSON chunk.

🔔

Know when the Groq API has an outage

Free email alerts. Star Groq API on Prismix — no credit card needed.

View status Sign in free →

FAQ

Why is Groq so fast?

Groq runs on custom Language Processing Units (LPUs) rather than GPUs. LPUs have deterministic execution with no memory bandwidth bottleneck, enabling 500–750 tokens/second for 70B models — 5–10x faster than typical GPU inference.

Groq vs OpenRouter vs Together AI — which for fastest inference?

Groq is fastest (LPU hardware), ideal for latency-sensitive use cases. OpenRouter gives access to more models. Together AI offers fine-tuning and dedicated deployments. For raw speed at common model sizes, Groq wins.

Does Groq support function calling / tool use?

Yes, via the OpenAI-compatible tools parameter. Supported on llama-3.3-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768. Same format as OpenAI tool definitions.

Monitor related services

Groq API status → Groq chat not working → OpenAI API not working → All AI status → All guides →