Groq API Not Working? Fix Auth, Rate Limits & Model Errors
Troubleshoot Groq API errors — 401 invalid API key, 429 rate limit exceeded (free tier 30 RPM), model not found, streaming issues, and context length exceeded on Llama and Mixtral models.
Common errors and fixes
Authentication — 401 error
A 401 error means the API key is missing, invalid, or passed incorrectly. Use the Groq SDK or the OpenAI SDK pointed at Groq's base URL:
# Python — using groq SDK
from groq import Groq
import os
client = Groq(api_key=os.environ["GROQ_API_KEY"])
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content) # Or using OpenAI SDK with Groq base URL
from openai import OpenAI
client = OpenAI(
api_key=os.environ["GROQ_API_KEY"],
base_url="https://api.groq.com/openai/v1"
) - Get API key at console.groq.com — go to API Keys → Create API Key. Keys start with
gsk_. - Set as env var
GROQ_API_KEY— or pass viaAuthorization: Bearerheader. - Legacy key format — if your key doesn't start with
gsk_, it may be an old format; generate a new key. - Playground key is separate — Groq's playground uses its own session token, not an API key. Always use keys from the API Keys section.
Rate limits — free tier
Free tier limits apply per model, not globally — using different models doesn't share the same bucket:
| Limit | Value |
|---|---|
| Requests/minute | 30 |
| Requests/day | 14,400 |
| Tokens/minute | 6,000 |
| Tokens/day | 500,000 |
Strategy: rotate between models to effectively multiply your rate limit:
import itertools
import time
MODELS = ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"]
model_cycle = itertools.cycle(MODELS)
def groq_call(client, prompt):
for attempt in range(5):
try:
return client.chat.completions.create(
model=next(model_cycle),
messages=[{"role": "user", "content": prompt}]
)
except Exception as e:
if "429" in str(e) and attempt < 4:
time.sleep((2 ** attempt) + 0.5)
else:
raise - Upgrade to paid plan for higher limits at console.groq.com.
- Use request queuing rather than parallel requests to stay within the per-minute cap.
Current model names
Groq deprecates old model versions frequently. Always verify at console.groq.com/docs/models. As of June 2026:
| Model ID | Notes |
|---|---|
llama-3.3-70b-versatile | Best quality, recommended |
llama-3.1-8b-instant | Fastest, lowest cost, 128K context |
llama3-8b-8192 | Older 8K context version (use llama-3.1-8b-instant instead) |
mixtral-8x7b-32768 | 32K context, good quality |
gemma2-9b-it | Google Gemma, fast |
llama-3.2-11b-vision-preview | Vision/image input support |
whisper-large-v3 | Speech-to-text (audio API, not chat) |
Deprecated (will return 404): llama2-70b-4096, llama3-70b-8192 — use llama-3.3-70b-versatile instead.
Context length exceeded
Each Groq model has different context limits. If you hit the limit, switch to a model with a larger context window:
| Model | Context |
|---|---|
llama-3.3-70b-versatile | 128K tokens |
llama-3.1-8b-instant | 128K tokens |
mixtral-8x7b-32768 | 32K tokens |
llama3-8b-8192 | 8K tokens (old model) |
For conversation history management, trim older messages to stay within limits:
def trim_messages(messages, max_tokens=100000):
# Keep system message + last N messages
system = [m for m in messages if m["role"] == "system"]
conversation = [m for m in messages if m["role"] != "system"]
# Simple trim: keep system + last 20 exchanges
return system + conversation[-40:] Streaming not working
Groq uses the OpenAI-compatible API for streaming. Iterate chunks — do NOT call .content on the streaming response object directly:
# Python streaming
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Write a short story"}],
stream=True
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True) // JavaScript streaming
const stream = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Write a story" }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
} - Set
stream=True(Python) orstream: true(REST/JS) in the request. - REST response is Server-Sent Events — each
data:line contains a JSON chunk.
Know when the Groq API has an outage
Free email alerts. Star Groq API on Prismix — no credit card needed.
FAQ
Why is Groq so fast?
Groq runs on custom Language Processing Units (LPUs) rather than GPUs. LPUs have deterministic execution with no memory bandwidth bottleneck, enabling 500–750 tokens/second for 70B models — 5–10x faster than typical GPU inference.
Groq vs OpenRouter vs Together AI — which for fastest inference?
Groq is fastest (LPU hardware), ideal for latency-sensitive use cases. OpenRouter gives access to more models. Together AI offers fine-tuning and dedicated deployments. For raw speed at common model sizes, Groq wins.
Does Groq support function calling / tool use?
Yes, via the OpenAI-compatible tools parameter. Supported on llama-3.3-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768. Same format as OpenAI tool definitions.