Groq Guide 2025: The Fastest AI Inference API
Groq uses custom LPU (Language Processing Unit) chips — not GPUs — to deliver 200–800 tokens/sec. That is 5–10× faster than OpenAI for the same model quality. Free tier, OpenAI-compatible API, and real-time use cases covered.
1. What is Groq?
Groq is an AI inference company that built custom silicon called the LPU (Language Processing Unit) specifically for running LLMs. Unlike GPUs, which are general-purpose compute chips repurposed for AI, LPUs are designed from scratch to stream tokens as fast as possible.
LPU vs GPU — the key difference
When speed matters
2. Groq free tier — rate limits
Groq's free tier is generous for development and experimentation. You do not need a credit card to start. Rate limits apply per model.
| Model | Requests/min | Tokens/min | Tokens/day |
|---|---|---|---|
| llama-3.3-70b-versatile | 30 | 6,000 | 131,072 |
| llama-3.1-8b-instant | 30 | 30,000 | 1,000,000 |
| mixtral-8x7b-32768 | 30 | 5,000 | 500,000 |
| gemma2-9b-it | 30 | 15,000 | 500,000 |
Tip: if you hit the 70B token/min limit, rotate to llama-3.1-8b-instant which has 5× the token budget at similar quality for shorter tasks.
3. Available models on Groq
Groq focuses on a curated set of high-quality open models rather than hosting every available model. Use the exact model IDs below in API calls.
llama-3.3-70b-versatile Best quality Meta's latest Llama 3.3 70B — best quality on Groq, competitive with GPT-4o on most tasks. 128k context window. Use this as your default for complex reasoning, coding, and writing.
llama-3.1-8b-instant Fastest + most generous limits Llama 3.1 8B — 5× higher token/min limit than the 70B. Use for classification, extraction, and simple Q&A where you need high throughput on the free tier. Still very capable for most tasks.
mixtral-8x7b-32768 32k context Mistral's Mixtral 8x7B MoE (mixture of experts) model. 32k context window. Strong for multilingual tasks and code. Note the 32k context limit if you need longer documents.
gemma2-9b-it Google Google's Gemma 2 9B — instruction-tuned. Excellent for structured output generation and tasks where concise, accurate answers matter. 8k context window.
deepseek-r1-distill-llama-70b Reasoning DeepSeek R1 distilled into a 70B Llama architecture — strong chain-of-thought reasoning at Groq speed. Use for math, logic, and step-by-step analysis.
4. API setup (OpenAI-compatible)
Groq's API is 100% compatible with the OpenAI API format. If you already use the OpenAI Python or Node.js SDK, you only need to change two lines.
Create a free account at console.groq.com
No credit card required. Sign up with email or GitHub.
Generate an API key
Go to API Keys in the console and click Create API Key. The key starts with gsk_. Store it in an environment variable — never hardcode it.
Install the OpenAI SDK
Set base_url to Groq
The only change from OpenAI: base_url="https://api.groq.com/openai/v1" and your Groq key.
5. Python example
This is a complete working example using the OpenAI Python SDK pointed at Groq. Replace gsk_YOUR_KEY_HERE with your key from the console.
from openai import OpenAI
client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key="gsk_YOUR_KEY_HERE"
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Summarize the key benefits of LPU inference"}],
max_tokens=512
)
print(response.choices[0].message.content) GROQ_API_KEY as your environment variable name — the groq Python package (alternative to openai) reads it automatically.
6. Speed benchmarks: Groq vs OpenAI
These are real-world throughput numbers for Llama 3.3 70B on Groq vs comparable models on OpenAI, measured at time-to-first-token and tokens/sec.
| Provider / Model | Tokens/sec | TTFT (avg) | Price / 1M tokens |
|---|---|---|---|
| Groq — Llama 3.3 70B | 200–800 | ~200ms | $0.59 in / $0.79 out |
| Groq — Llama 3.1 8B | 800–1200 | ~100ms | $0.05 in / $0.08 out |
| OpenAI — GPT-4o | 60–100 | ~500ms | $2.50 in / $10 out |
| OpenAI — GPT-4o mini | 80–120 | ~350ms | $0.15 in / $0.60 out |
| Together AI — Llama 3.3 70B | 100–200 | ~300ms | $0.88 in / $0.88 out |
Note: tokens/sec figures are representative ranges. Actual speed varies based on prompt length, concurrent load, and model cache state. TTFT = time to first token.
7. When to use Groq
⚡ Real-time AI applications
Chatbots, autocomplete systems, coding assistants, and customer support tools where users see the AI response stream live. Lower latency = better perceived quality. Groq's 200ms TTFT versus 500ms+ on cloud GPU providers is noticeable.
🎤 Voice AI pipelines
Voice AI requires generating text to speech near-instantly after transcription. A voice pipeline that uses Groq for LLM inference + Deepgram for STT + ElevenLabs Turbo for TTS can achieve sub-500ms full round-trip latency — indistinguishable from human conversation.
📦 Batch processing
Classifying 100,000 customer support tickets, extracting structured data from documents, or generating summaries for a content pipeline. At 800 tok/sec, a 500-token classification task takes under 1 second — 100k tasks in about 28 hours versus 280 hours on a 100 tok/sec provider.
🤖 Agentic loops
AI agents call the LLM repeatedly — 10–50+ times per task. At 100 tok/sec, a 20-step agent might take 30+ seconds per LLM call. At 800 tok/sec on Groq, each call is under 5 seconds. That 6× speedup compounds across every agent iteration.
🚫 When NOT to use Groq
Groq has a smaller model catalog than Together AI or OpenRouter. If you need GPT-4o's specific capabilities (vision with DALL-E, Code Interpreter), Claude's quality for long documents, or fine-tuning support — use the specialized provider. Groq is best when you want open models at maximum speed.
Monitor Groq API status at Prismix
Before building on Groq, check live Groq API status at Prismix. Get free email alerts when Groq has an outage — know instantly whether your app's slowdown is Groq's issue or yours.
FAQ
Is Groq free to use?
Yes. Groq offers a free tier with rate limits: 30 requests/min and 6,000 tokens/min on popular models (Llama 3.3 70B). The free tier is generous enough for development, experimentation, and small apps. Paid tiers remove rate limits and increase throughput.
How fast is Groq compared to OpenAI?
Groq LPU hardware typically delivers 200–800 tokens/sec for Llama 3.3 70B, compared to 50–100 tokens/sec on OpenAI GPT-4o. In practice, Groq is 5–10× faster for inference-heavy workloads like voice AI, real-time chat, and batch processing.
Is Groq compatible with the OpenAI SDK?
Yes. Groq uses an OpenAI-compatible API. You only need to change two lines: set base_url to https://api.groq.com/openai/v1 and replace your OpenAI API key with your Groq API key (starts with gsk_). The rest of your code stays identical.
What models does Groq support?
Groq supports Llama 3.3 70B, Llama 3.1 8B, Mixtral 8x7B, Gemma 2 9B, and DeepSeek R1 Distill variants. The full model list is at console.groq.com/docs/models.