Groq API 7 min read

Groq Guide 2025: The Fastest AI Inference API

Q: What models does Groq support?

Groq supports Llama 3.3 70B (llama-3.3-70b-versatile), Llama 3.1 8B (llama-3.1-8b-instant), Mixtral 8x7B (mixtral-8x7b-32768), Gemma 2 9B (gemma2-9b-it), and DeepSeek R1 Distill variants. The full model list is at console.groq.com/docs/models.

Groq uses custom LPU (Language Processing Unit) chips — not GPUs — to deliver 200–800 tokens/sec. That is 5–10× faster than OpenAI for the same model quality. Free tier, OpenAI-compatible API, and real-time use cases covered.

1. What is Groq?

Groq is an AI inference company that built custom silicon called the LPU (Language Processing Unit) specifically for running LLMs. Unlike GPUs, which are general-purpose compute chips repurposed for AI, LPUs are designed from scratch to stream tokens as fast as possible.

LPU vs GPU — the key difference

GPU (OpenAI, Anthropic, Google) — general-purpose parallel compute, excellent for training, but not optimal for sequential token generation. Typical speed: 50–100 tokens/sec.

LPU (Groq) — purpose-built for inference only. Each token is produced as fast as the chip can compute it, without the memory bandwidth bottleneck that limits GPUs. Typical speed: 200–800 tokens/sec for 70B models.

When speed matters

✓Voice AI — generating speech text in real time requires low latency between user utterance and response

✓Real-time chat — streaming responses appear faster, reducing perceived wait time

✓Batch processing — running 1,000 classification tasks finishes in minutes instead of hours

✓Agentic loops — multi-step agents that call the LLM dozens of times per task benefit enormously

2. Groq free tier — rate limits

Groq's free tier is generous for development and experimentation. You do not need a credit card to start. Rate limits apply per model.

Model	Requests/min	Tokens/min	Tokens/day
llama-3.3-70b-versatile	30	6,000	131,072
llama-3.1-8b-instant	30	30,000	1,000,000
mixtral-8x7b-32768	30	5,000	500,000
gemma2-9b-it	30	15,000	500,000

Tip: if you hit the 70B token/min limit, rotate to llama-3.1-8b-instant which has 5× the token budget at similar quality for shorter tasks.

3. Available models on Groq

Groq focuses on a curated set of high-quality open models rather than hosting every available model. Use the exact model IDs below in API calls.

llama-3.3-70b-versatile Best quality

Meta's latest Llama 3.3 70B — best quality on Groq, competitive with GPT-4o on most tasks. 128k context window. Use this as your default for complex reasoning, coding, and writing.

llama-3.1-8b-instant Fastest + most generous limits

Llama 3.1 8B — 5× higher token/min limit than the 70B. Use for classification, extraction, and simple Q&A where you need high throughput on the free tier. Still very capable for most tasks.

mixtral-8x7b-32768 32k context

Mistral's Mixtral 8x7B MoE (mixture of experts) model. 32k context window. Strong for multilingual tasks and code. Note the 32k context limit if you need longer documents.

gemma2-9b-it Google

Google's Gemma 2 9B — instruction-tuned. Excellent for structured output generation and tasks where concise, accurate answers matter. 8k context window.

deepseek-r1-distill-llama-70b Reasoning

DeepSeek R1 distilled into a 70B Llama architecture — strong chain-of-thought reasoning at Groq speed. Use for math, logic, and step-by-step analysis.

4. API setup (OpenAI-compatible)

Groq's API is 100% compatible with the OpenAI API format. If you already use the OpenAI Python or Node.js SDK, you only need to change two lines.

Create a free account at console.groq.com

No credit card required. Sign up with email or GitHub.

Generate an API key

Go to API Keys in the console and click Create API Key. The key starts with gsk_. Store it in an environment variable — never hardcode it.

Install the OpenAI SDK

pip install openai

Set base_url to Groq

The only change from OpenAI: base_url="https://api.groq.com/openai/v1" and your Groq key.

5. Python example

This is a complete working example using the OpenAI Python SDK pointed at Groq. Replace gsk_YOUR_KEY_HERE with your key from the console.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="gsk_YOUR_KEY_HERE"
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Summarize the key benefits of LPU inference"}],
    max_tokens=512
)
print(response.choices[0].message.content)

Pro tip: Use GROQ_API_KEY as your environment variable name — the groq Python package (alternative to openai) reads it automatically.

6. Speed benchmarks: Groq vs OpenAI

These are real-world throughput numbers for Llama 3.3 70B on Groq vs comparable models on OpenAI, measured at time-to-first-token and tokens/sec.

Provider / Model	Tokens/sec	TTFT (avg)	Price / 1M tokens
Groq — Llama 3.3 70B	200–800	~200ms	$0.59 in / $0.79 out
Groq — Llama 3.1 8B	800–1200	~100ms	$0.05 in / $0.08 out
OpenAI — GPT-4o	60–100	~500ms	$2.50 in / $10 out
OpenAI — GPT-4o mini	80–120	~350ms	$0.15 in / $0.60 out
Together AI — Llama 3.3 70B	100–200	~300ms	$0.88 in / $0.88 out

Note: tokens/sec figures are representative ranges. Actual speed varies based on prompt length, concurrent load, and model cache state. TTFT = time to first token.

7. When to use Groq

⚡ Real-time AI applications

Chatbots, autocomplete systems, coding assistants, and customer support tools where users see the AI response stream live. Lower latency = better perceived quality. Groq's 200ms TTFT versus 500ms+ on cloud GPU providers is noticeable.

🎤 Voice AI pipelines

Voice AI requires generating text to speech near-instantly after transcription. A voice pipeline that uses Groq for LLM inference + Deepgram for STT + ElevenLabs Turbo for TTS can achieve sub-500ms full round-trip latency — indistinguishable from human conversation.

📦 Batch processing

Classifying 100,000 customer support tickets, extracting structured data from documents, or generating summaries for a content pipeline. At 800 tok/sec, a 500-token classification task takes under 1 second — 100k tasks in about 28 hours versus 280 hours on a 100 tok/sec provider.

🤖 Agentic loops

AI agents call the LLM repeatedly — 10–50+ times per task. At 100 tok/sec, a 20-step agent might take 30+ seconds per LLM call. At 800 tok/sec on Groq, each call is under 5 seconds. That 6× speedup compounds across every agent iteration.

🚫 When NOT to use Groq

Groq has a smaller model catalog than Together AI or OpenRouter. If you need GPT-4o's specific capabilities (vision with DALL-E, Code Interpreter), Claude's quality for long documents, or fine-tuning support — use the specialized provider. Groq is best when you want open models at maximum speed.

🔔

Monitor Groq API status at Prismix

Before building on Groq, check live Groq API status at Prismix. Get free email alerts when Groq has an outage — know instantly whether your app's slowdown is Groq's issue or yours.

Groq status Get alerts free →

FAQ

Is Groq free to use?

Yes. Groq offers a free tier with rate limits: 30 requests/min and 6,000 tokens/min on popular models (Llama 3.3 70B). The free tier is generous enough for development, experimentation, and small apps. Paid tiers remove rate limits and increase throughput.

How fast is Groq compared to OpenAI?

Groq LPU hardware typically delivers 200–800 tokens/sec for Llama 3.3 70B, compared to 50–100 tokens/sec on OpenAI GPT-4o. In practice, Groq is 5–10× faster for inference-heavy workloads like voice AI, real-time chat, and batch processing.

Is Groq compatible with the OpenAI SDK?

Yes. Groq uses an OpenAI-compatible API. You only need to change two lines: set base_url to https://api.groq.com/openai/v1 and replace your OpenAI API key with your Groq API key (starts with gsk_). The rest of your code stays identical.

What models does Groq support?

Groq supports Llama 3.3 70B, Llama 3.1 8B, Mixtral 8x7B, Gemma 2 9B, and DeepSeek R1 Distill variants. The full model list is at console.groq.com/docs/models.

Groq API not working → Groq vs Together AI → Meta Llama guide → OpenAI alternatives → All guides →