Meta Llama Guide 2025: Download, Run Locally & Use the API
Meta's Llama models are the most capable open-weight LLMs available. Run them locally for free with Ollama or LM Studio, or use via Groq's free API. Full model comparison, Python code examples, and fine-tuning intro.
1. What is Meta Llama?
Meta Llama is a family of open-weight large language models released by Meta AI. “Open-weight” means the model weights are publicly available — you can download, run, and fine-tune them without paying per-token API fees, and your data never leaves your infrastructure.
Key facts about Llama
2. Llama 3.3 70B vs 3.1 8B vs 3.2 1B — which to choose
| Model | Quality | RAM (local) | Best for |
|---|---|---|---|
| Llama 3.3 70B | GPT-4o level | 40GB+ | Complex reasoning, coding, writing |
| Llama 3.1 8B | GPT-3.5 level | 6GB VRAM / 8GB RAM | Fast API use, classification, extraction |
| Llama 3.2 3B | Good for size | 3GB VRAM / 4GB RAM | Consumer hardware, offline use |
| Llama 3.2 1B | Basic | 1GB VRAM / 2GB RAM | Edge devices, very constrained hardware |
3. Run Llama locally with Ollama
Ollama is the easiest way to run Llama locally. It handles model downloading, serving, and provides an OpenAI-compatible API at localhost:11434.
Install Ollama
Download from ollama.com. Available for Mac (M1–M4 and Intel), Windows, and Linux. See the full Ollama guide for setup details.
Run Llama
Downloads and runs Llama 3.3 70B. Requires 40GB+ RAM. For smaller hardware:
Llama 3.2 3B — runs on 4GB RAM, good for most machines.
Use the local API
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. Point any OpenAI SDK at it with api_key="ollama".
4. Run Llama locally with LM Studio (GUI)
LM Studio is a desktop app with a ChatGPT-like interface for running local models. Better for non-developers — you browse and download models through a UI and chat with them without any terminal.
localhost:1234/v1. See the full LM Studio guide for detailed setup.
5. Use Llama via API — Groq (free) or Together AI
No local GPU? Use Llama via cloud inference providers. Both are OpenAI SDK-compatible — just change the base URL and API key.
Option A: Groq — free, fastest (200–800 tok/sec)
Free tier: 30 req/min, 6k tokens/min for Llama 3.3 70B. No credit card. Get your key at console.groq.com.
from openai import OpenAI
# Llama 3.3 70B via Groq — fastest free option
client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key="gsk_YOUR_GROQ_KEY"
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Explain the difference between RAG and fine-tuning"}],
max_tokens=1024
)
print(response.choices[0].message.content) Option B: Together AI — pay-per-token, 100+ models ($0.88/1M)
Together AI hosts 100+ open models. Llama 3.3 70B Instruct Turbo costs $0.88/1M tokens in and out. Note: model IDs are case-sensitive.
from openai import OpenAI
# Llama 3.3 70B via Together AI — pay-as-you-go, $0.88/1M tokens
client = OpenAI(
base_url="https://api.together.xyz/v1",
api_key="YOUR_TOGETHER_KEY"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Write a Python function to parse JSON with error handling"}],
max_tokens=1024
)
print(response.choices[0].message.content) Meta's own API (Llama API)
Meta launched its own inference API at llama.developer.meta.com. Free tier available. Also OpenAI-compatible. Less battle-tested than Groq or Together AI as of mid-2025, but directly from the model creator.
6. Fine-tuning basics
Fine-tuning Llama adapts the model to your specific domain, writing style, or task — something impossible with closed models like GPT-4o or Claude. The most common technique is LoRA (Low-Rank Adaptation), which trains only a small fraction of the model's weights.
Unsloth — recommended for beginners
Unsloth makes fine-tuning 2–5× faster with 80% less VRAM. Run in a free Google Colab T4 GPU notebook. See unsloth.ai for Llama 3.3 fine-tuning notebooks — they are ready to run with your JSONL dataset.
Axolotl — for production fine-tuning
Axolotl is a production-ready fine-tuning framework supporting LoRA, QLoRA, and full fine-tuning. YAML config-based. Works on multi-GPU setups. See github.com/axolotl-ai-cloud/axolotl.
Data format
Fine-tuning requires conversation data in JSONL format: each line is a JSON object with a messages array (system/user/assistant turns). Typical dataset size: 100–10,000 examples for instruction tuning. Minimum 50 high-quality examples can already meaningfully shift model behavior.
7. Llama vs GPT-4o vs Claude — open vs closed
| Criteria | Llama 3.3 70B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| Cost | Free (Groq free tier) | $2.50–$10/1M tokens | $3–$15/1M tokens |
| Self-hostable | Yes — Ollama, Docker | No | No |
| Fine-tuning | Yes — LoRA, QLoRA | Yes (paid, limited) | No |
| Context window | 128k tokens | 128k tokens | 200k tokens |
| Image input | Llama 3.2 11B/90B only | Yes + DALL-E 3 | Yes |
| Speed (API) | 200–800 tok/s (Groq) | 60–100 tok/s | 50–90 tok/s |
| Data privacy | Full (self-hosted) | OpenAI's servers | Anthropic's servers |
Decision guide
8. Related: full local LLM setup guides
Ollama Guide — CLI-first local LLM runner
Install on Mac/Windows/Linux, REST API at localhost:11434, Open WebUI, Docker, VS Code integration
LM Studio Guide — GUI-first local LLM app
Chat UI, model browser, local server at localhost:1234, GPU acceleration setup
Groq Guide — fastest Llama API (free tier)
LPU hardware, 200-800 tok/sec, free 6k tokens/min, OpenAI-compatible
Monitor Groq and Together AI status at Prismix
If your Llama-powered app stops working, check live API status at Prismix. Free email alerts so you know instantly if it's the provider's issue or yours.
FAQ
Is Meta Llama free to use?
Yes. Meta releases Llama models under the Meta Llama Community License — free for commercial use under 700M MAU. Run locally with Ollama for free, or use via Groq's free API tier (6k tokens/min, no credit card). At scale, providers like Together AI charge $0.88/1M tokens.
Which Llama model should I use?
For best quality: Llama 3.3 70B (GPT-4o competitive, 128k context). For fast/cheap API use: Llama 3.1 8B (very capable, $0.05/1M on Groq). For local use on consumer hardware: Llama 3.2 3B (runs on 4GB RAM).
Can I run Llama 3.3 70B locally?
Yes, but it requires 40GB+ of RAM or VRAM. For a Mac with Apple Silicon, you need an M1/M2/M3 with at least 64GB unified memory. Alternatively, use Groq's free API tier for the 70B without any hardware.
How does Llama compare to GPT-4o and Claude?
Llama 3.3 70B is competitive with GPT-4o on most benchmarks. The key advantage of Llama: open weights (self-host, fine-tune, data privacy). GPT-4o wins for multimodal tools (DALL-E 3, Code Interpreter). Claude wins for long documents (200k context) and coding benchmarks.