Groq API LLM API Fix 4 min read

Groq API Not Working? Fix Auth, Rate Limits & Model Errors

Troubleshoot Groq API errors — 401 invalid API key, 429 rate limit exceeded (free tier 30 RPM), model not found, streaming issues, and context length exceeded on Llama and Mixtral models.

Groq API live status

Groq API — live status

Updated every 5 minutes · Full incident history →

Full status →

Common errors and fixes

Authentication — 401 error

A 401 error means the API key is missing, invalid, or passed incorrectly. Use the Groq SDK or the OpenAI SDK pointed at Groq's base URL:

# Python — using groq SDK
from groq import Groq
import os

client = Groq(api_key=os.environ["GROQ_API_KEY"])

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
# Or using OpenAI SDK with Groq base URL
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1"
)
  • Get API key at console.groq.com — go to API Keys → Create API Key. Keys start with gsk_.
  • Set as env var GROQ_API_KEY — or pass via Authorization: Bearer header.
  • Legacy key format — if your key doesn't start with gsk_, it may be an old format; generate a new key.
  • Playground key is separate — Groq's playground uses its own session token, not an API key. Always use keys from the API Keys section.

Rate limits — free tier

Free tier limits apply per model, not globally — using different models doesn't share the same bucket:

Limit Value
Requests/minute30
Requests/day14,400
Tokens/minute6,000
Tokens/day500,000

Strategy: rotate between models to effectively multiply your rate limit:

import itertools
import time

MODELS = ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"]
model_cycle = itertools.cycle(MODELS)

def groq_call(client, prompt):
    for attempt in range(5):
        try:
            return client.chat.completions.create(
                model=next(model_cycle),
                messages=[{"role": "user", "content": prompt}]
            )
        except Exception as e:
            if "429" in str(e) and attempt < 4:
                time.sleep((2 ** attempt) + 0.5)
            else:
                raise
  • Upgrade to paid plan for higher limits at console.groq.com.
  • Use request queuing rather than parallel requests to stay within the per-minute cap.

Current model names

Groq deprecates old model versions frequently. Always verify at console.groq.com/docs/models. As of June 2026:

Model ID Notes
llama-3.3-70b-versatileBest quality, recommended
llama-3.1-8b-instantFastest, lowest cost, 128K context
llama3-8b-8192Older 8K context version (use llama-3.1-8b-instant instead)
mixtral-8x7b-3276832K context, good quality
gemma2-9b-itGoogle Gemma, fast
llama-3.2-11b-vision-previewVision/image input support
whisper-large-v3Speech-to-text (audio API, not chat)

Deprecated (will return 404): llama2-70b-4096, llama3-70b-8192 — use llama-3.3-70b-versatile instead.

Context length exceeded

Each Groq model has different context limits. If you hit the limit, switch to a model with a larger context window:

Model Context
llama-3.3-70b-versatile128K tokens
llama-3.1-8b-instant128K tokens
mixtral-8x7b-3276832K tokens
llama3-8b-81928K tokens (old model)

For conversation history management, trim older messages to stay within limits:

def trim_messages(messages, max_tokens=100000):
    # Keep system message + last N messages
    system = [m for m in messages if m["role"] == "system"]
    conversation = [m for m in messages if m["role"] != "system"]
    # Simple trim: keep system + last 20 exchanges
    return system + conversation[-40:]

Streaming not working

Groq uses the OpenAI-compatible API for streaming. Iterate chunks — do NOT call .content on the streaming response object directly:

# Python streaming
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a short story"}],
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
// JavaScript streaming
const stream = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [{ role: "user", content: "Write a story" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
  • Set stream=True (Python) or stream: true (REST/JS) in the request.
  • REST response is Server-Sent Events — each data: line contains a JSON chunk.
🔔

Know when the Groq API has an outage

Free email alerts. Star Groq API on Prismix — no credit card needed.

FAQ

Why is Groq so fast?

Groq runs on custom Language Processing Units (LPUs) rather than GPUs. LPUs have deterministic execution with no memory bandwidth bottleneck, enabling 500–750 tokens/second for 70B models — 5–10x faster than typical GPU inference.

Groq vs OpenRouter vs Together AI — which for fastest inference?

Groq is fastest (LPU hardware), ideal for latency-sensitive use cases. OpenRouter gives access to more models. Together AI offers fine-tuning and dedicated deployments. For raw speed at common model sizes, Groq wins.

Does Groq support function calling / tool use?

Yes, via the OpenAI-compatible tools parameter. Supported on llama-3.3-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768. Same format as OpenAI tool definitions.

Monitor related services