Ollama Guide: Run Local LLMs on Your Mac or PC (2025)
Ollama guide 2025 — install on Mac/Windows/Linux (3 steps), run Llama 3.3/Mistral/Phi-3/DeepSeek locally, CLI commands (pull/run/list/rm), REST API at localhost:11434 (OpenAI-compatible), Open WebUI ChatGPT-like interface via Docker, use with Cursor via Continue extension. Free, private, offline, no API costs.
1. What is Ollama?
Ollama is a free, open-source tool that lets you run large language models locally on your Mac, Windows PC, or Linux machine. No internet connection required once models are downloaded — all processing stays on your device. Think of it as “Docker for LLMs” — simple CLI commands to pull and run models.
Models you can run with Ollama
100+ models available at ollama.com/library
Hardware requirements by model size
| Model | VRAM (GPU) | RAM (CPU only) | Notes |
|---|---|---|---|
| Llama 3.2 1B | 1GB VRAM | 4GB RAM | Runs on any modern machine |
| Llama 3.2 3B | 2.5GB VRAM | 8GB RAM | Good quality for the size |
| Phi-3 Mini 4B | 3GB VRAM | 8GB RAM | Microsoft's small model |
| Llama 3.1 8B | 5GB VRAM | 8GB RAM | Good balance quality/speed |
| Mistral 7B | 5GB VRAM | 8GB RAM | Good for coding and reasoning |
| Llama 3.3 70B | 40GB VRAM | 40GB RAM | Frontier-level quality |
| DeepSeek-R1 70B | 40GB VRAM | 40GB RAM | Strong reasoning |
Use cases
2. Install Ollama (3 steps)
Installation takes under 5 minutes on any platform. Ollama runs as a background service and starts automatically on login.
Go to ollama.com — click Download
Navigate to ollama.com and click the Download button. Installers are available for macOS (Apple Silicon + Intel), Windows 10/11, and Linux.
Run the installer and follow prompts
On Mac: drag Ollama to Applications and launch it. On Windows: run the .exe installer. Ollama runs as a background service and an icon appears in your menu bar / system tray.
Verify in terminal: ollama --version
Open a terminal (Terminal on Mac, PowerShell or Command Prompt on Windows) and run ollama --version. You should see the version number, confirming Ollama is installed and on your PATH.
3. Your first model (4 steps)
Start with Llama 3.2 — it's around 2GB and works well on most machines. The first download takes a few minutes depending on your connection.
Pull a model
Downloads the default Llama 3.2 model (~2GB). Progress shows in the terminal.
Run it
Starts an interactive chat session in your terminal. The model loads into memory (5–10 seconds on first run).
Type your message and press Enter
The model responds locally — no internet, no API key, no cost. You'll see the response stream in real time.
Exit the session
Type /bye or press Ctrl+D to exit the interactive session.
Other models to try
ollama run mistral Mistral 7B — good for coding ollama run phi3 Microsoft Phi-3 Mini — fast and small ollama run llama3.3 Llama 3.3 70B — requires 40GB+ VRAM/RAM ollama run deepseek-r1 DeepSeek R1 — strong reasoning model ollama run codegemma Google CodeGemma — coding-focused model 4. Useful Ollama CLI commands
These are the commands you'll use most often to manage models and the Ollama server.
ollama list # see all downloaded models ollama pull llama3.2 # download a model ollama rm llama3.2 # delete a model ollama show llama3.2 # model details (size, architecture, quantization) ollama ps # see running models ollama serve # start Ollama server manually (usually auto-starts)
5. Use Ollama via API
Ollama runs a local REST API at http://localhost:11434 — compatible with the OpenAI API format. This means any code written for OpenAI can be pointed at Ollama with 2 lines changed.
Generate endpoint (simple curl)
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false}' OpenAI-compatible chat endpoint (Python)
Use the OpenAI Python SDK — just change base_url and api_key. The rest of your code stays the same.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required param, value doesn't matter
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Explain recursion in simple terms"}]
)
print(response.choices[0].message.content) 6. Open WebUI — ChatGPT-like interface for Ollama
Open WebUI is a free, open-source web interface for Ollama that looks and feels like ChatGPT. If you want a GUI instead of terminal chat, this is the best option.
Install with Docker (one command)
docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui --restart always \ ghcr.io/open-webui/open-webui:main
Requires Docker Desktop. After the container starts, open http://localhost:3000 in your browser.
What you get
7. Use Ollama with Cursor / VS Code
To use a local Ollama model as your AI coding assistant in Cursor or VS Code, use the Continue extension — the most popular open-source AI coding extension.
Install the Continue extension in VS Code or Cursor
Open the Extensions panel (Ctrl+Shift+X) and search for “Continue”. Install the extension by Continue.dev — it's free and open-source.
Open Continue settings — add a new model
Click the Continue icon in the sidebar, then open settings (gear icon). Click “Add Model”.
Set Provider: Ollama, Model: llama3.2
Select Ollama as the provider. Enter the model name you have installed (e.g. llama3.2 or mistral). The base URL defaults to http://localhost:11434.
Save — Continue now uses your local model
Continue replaces its cloud AI backend with your local Ollama model. You get code autocomplete, inline edits, and AI chat — all running locally, no API costs.
http://localhost:11434 as the base URL.
8. Tips for best performance
Tip 1: Use smaller quantized models
The q4_K_M suffix means 4-bit quantization — uses ~50% less VRAM with about 5% quality loss. A great tradeoff for machines with less VRAM.
Tip 2: Mac M-series — use your full unified memory
Apple Silicon Macs use unified memory shared between CPU and GPU — Ollama uses all of it automatically via Metal. An M2 Pro with 32GB can run 30B models comfortably. An M1 with 16GB is great for 7B–13B models.
Tip 3: Windows with NVIDIA GPU
Ollama uses CUDA automatically if you have an NVIDIA GPU with compatible drivers installed. No configuration needed — Ollama detects the GPU and uses it for acceleration. GPU inference is typically 5–10x faster than CPU-only.
Tip 4: Keep models warm for faster responses
The first response takes 5–10 seconds as the model loads into memory. Subsequent responses in the same session are much faster since the model stays loaded. Ollama keeps models in memory for a few minutes after the last request.
Tip 5: Concurrent requests via the server
Ollama's REST server handles concurrent requests — you can call the API from an app while also using the CLI or Open WebUI simultaneously. Useful for building apps that use the API in the background.
Monitor Ollama and local LLM tool status at Prismix
Before starting a local LLM session, check Ollama's status at Prismix — know instantly whether issues are with the Ollama service itself or your local setup. Get free alerts so you're the first to know.
FAQ
What is Ollama used for?
Ollama lets you run open-source LLMs locally on your Mac, PC, or Linux machine — no internet, no API costs, no data leaving your computer. Common uses: private AI assistant, offline coding helper (via Continue extension), experimenting with open-source models, building apps without sending data to cloud APIs.
Is Ollama free?
Yes. Ollama itself is completely free and open-source (MIT license). The models (Llama, Mistral, etc.) are also free to download and use. The only cost is your hardware — you need enough RAM/VRAM to run the models.
What models can I run with Ollama?
100+ models including Llama 3.3 (Meta), Mistral, Phi-3 (Microsoft), Gemma 2 (Google), DeepSeek-R1, Qwen, CodeLlama, and more. See all models at ollama.com/library. Model size requirements vary — smaller models (1B–7B) run on most modern machines; 70B models need 40GB+ RAM or VRAM.
Can Ollama run on Windows?
Yes. Ollama supports Windows 10 and 11 with GPU acceleration via CUDA (NVIDIA GPU) or CPU inference. Apple Silicon Macs (M1–M4) and AMD GPUs on Linux are also well-supported. CPU-only inference is available on all platforms but is 5–10x slower than GPU.