Ollama Local LLMs 8 min read

Ollama Guide: Run Local LLMs on Your Mac or PC (2025)

Ollama guide 2025 — install on Mac/Windows/Linux (3 steps), run Llama 3.3/Mistral/Phi-3/DeepSeek locally, CLI commands (pull/run/list/rm), REST API at localhost:11434 (OpenAI-compatible), Open WebUI ChatGPT-like interface via Docker, use with Cursor via Continue extension. Free, private, offline, no API costs.

1. What is Ollama?

Ollama is a free, open-source tool that lets you run large language models locally on your Mac, Windows PC, or Linux machine. No internet connection required once models are downloaded — all processing stays on your device. Think of it as “Docker for LLMs” — simple CLI commands to pull and run models.

Models you can run with Ollama

Llama 3.3 / 3.2 / 3.1 — Meta's flagship open models
Mistral / Mixtral — great for coding and reasoning
Phi-3 / Phi-4 — Microsoft's fast small models
DeepSeek-R1 — strong reasoning model
Gemma 2 / CodeGemma — Google's open models
Qwen / CodeLlama — coding-focused models

100+ models available at ollama.com/library

Hardware requirements by model size

Model VRAM (GPU) RAM (CPU only) Notes
Llama 3.2 1B 1GB VRAM 4GB RAM Runs on any modern machine
Llama 3.2 3B 2.5GB VRAM 8GB RAM Good quality for the size
Phi-3 Mini 4B 3GB VRAM 8GB RAM Microsoft's small model
Llama 3.1 8B 5GB VRAM 8GB RAM Good balance quality/speed
Mistral 7B 5GB VRAM 8GB RAM Good for coding and reasoning
Llama 3.3 70B 40GB VRAM 40GB RAM Frontier-level quality
DeepSeek-R1 70B 40GB VRAM 40GB RAM Strong reasoning

Use cases

Private local AI — no data leaves your machine
Offline development assistant — code anywhere, no internet needed
Running AI in airgapped or enterprise environments
Experimenting with open-source models for free — no API costs
Building apps with local inference — no cloud dependency

2. Install Ollama (3 steps)

Installation takes under 5 minutes on any platform. Ollama runs as a background service and starts automatically on login.

1

Go to ollama.com — click Download

Navigate to ollama.com and click the Download button. Installers are available for macOS (Apple Silicon + Intel), Windows 10/11, and Linux.

2

Run the installer and follow prompts

On Mac: drag Ollama to Applications and launch it. On Windows: run the .exe installer. Ollama runs as a background service and an icon appears in your menu bar / system tray.

macOS note: Ollama uses Metal for GPU inference on Apple Silicon (M1–M4). An M1 Mac with 16GB unified memory can run 7B–13B models smoothly — the unified memory architecture means no separate VRAM needed.
3

Verify in terminal: ollama --version

Open a terminal (Terminal on Mac, PowerShell or Command Prompt on Windows) and run ollama --version. You should see the version number, confirming Ollama is installed and on your PATH.

3. Your first model (4 steps)

Start with Llama 3.2 — it's around 2GB and works well on most machines. The first download takes a few minutes depending on your connection.

1

Pull a model

ollama pull llama3.2

Downloads the default Llama 3.2 model (~2GB). Progress shows in the terminal.

2

Run it

ollama run llama3.2

Starts an interactive chat session in your terminal. The model loads into memory (5–10 seconds on first run).

3

Type your message and press Enter

The model responds locally — no internet, no API key, no cost. You'll see the response stream in real time.

4

Exit the session

/bye

Type /bye or press Ctrl+D to exit the interactive session.

Other models to try

ollama run mistral Mistral 7B — good for coding
ollama run phi3 Microsoft Phi-3 Mini — fast and small
ollama run llama3.3 Llama 3.3 70B — requires 40GB+ VRAM/RAM
ollama run deepseek-r1 DeepSeek R1 — strong reasoning model
ollama run codegemma Google CodeGemma — coding-focused model

4. Useful Ollama CLI commands

These are the commands you'll use most often to manage models and the Ollama server.

ollama list              # see all downloaded models
ollama pull llama3.2     # download a model
ollama rm llama3.2       # delete a model
ollama show llama3.2     # model details (size, architecture, quantization)
ollama ps                # see running models
ollama serve             # start Ollama server manually (usually auto-starts)

5. Use Ollama via API

Ollama runs a local REST API at http://localhost:11434 — compatible with the OpenAI API format. This means any code written for OpenAI can be pointed at Ollama with 2 lines changed.

Generate endpoint (simple curl)

curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false}'

OpenAI-compatible chat endpoint (Python)

Use the OpenAI Python SDK — just change base_url and api_key. The rest of your code stays the same.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required param, value doesn't matter
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain recursion in simple terms"}]
)
print(response.choices[0].message.content)

6. Open WebUI — ChatGPT-like interface for Ollama

Open WebUI is a free, open-source web interface for Ollama that looks and feels like ChatGPT. If you want a GUI instead of terminal chat, this is the best option.

Install with Docker (one command)

docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

Requires Docker Desktop. After the container starts, open http://localhost:3000 in your browser.

What you get

Create an account — then select any Ollama model you have installed
Chat like you would with ChatGPT — conversation history saved locally
Multiple models side-by-side for comparison
File upload and document Q&A
Web search integration (with configuration)

7. Use Ollama with Cursor / VS Code

To use a local Ollama model as your AI coding assistant in Cursor or VS Code, use the Continue extension — the most popular open-source AI coding extension.

1

Install the Continue extension in VS Code or Cursor

Open the Extensions panel (Ctrl+Shift+X) and search for “Continue”. Install the extension by Continue.dev — it's free and open-source.

2

Open Continue settings — add a new model

Click the Continue icon in the sidebar, then open settings (gear icon). Click “Add Model”.

3

Set Provider: Ollama, Model: llama3.2

Select Ollama as the provider. Enter the model name you have installed (e.g. llama3.2 or mistral). The base URL defaults to http://localhost:11434.

4

Save — Continue now uses your local model

Continue replaces its cloud AI backend with your local Ollama model. You get code autocomplete, inline edits, and AI chat — all running locally, no API costs.

Alternative: Cursor also supports Ollama as a custom model. Go to Cursor Settings → Models → Add custom model → enter http://localhost:11434 as the base URL.

8. Tips for best performance

Tip 1: Use smaller quantized models

ollama pull llama3.2:3b-instruct-q4_K_M

The q4_K_M suffix means 4-bit quantization — uses ~50% less VRAM with about 5% quality loss. A great tradeoff for machines with less VRAM.

Tip 2: Mac M-series — use your full unified memory

Apple Silicon Macs use unified memory shared between CPU and GPU — Ollama uses all of it automatically via Metal. An M2 Pro with 32GB can run 30B models comfortably. An M1 with 16GB is great for 7B–13B models.

Tip 3: Windows with NVIDIA GPU

Ollama uses CUDA automatically if you have an NVIDIA GPU with compatible drivers installed. No configuration needed — Ollama detects the GPU and uses it for acceleration. GPU inference is typically 5–10x faster than CPU-only.

Tip 4: Keep models warm for faster responses

The first response takes 5–10 seconds as the model loads into memory. Subsequent responses in the same session are much faster since the model stays loaded. Ollama keeps models in memory for a few minutes after the last request.

Tip 5: Concurrent requests via the server

Ollama's REST server handles concurrent requests — you can call the API from an app while also using the CLI or Open WebUI simultaneously. Useful for building apps that use the API in the background.

🔔

Monitor Ollama and local LLM tool status at Prismix

Before starting a local LLM session, check Ollama's status at Prismix — know instantly whether issues are with the Ollama service itself or your local setup. Get free alerts so you're the first to know.

FAQ

What is Ollama used for?

Ollama lets you run open-source LLMs locally on your Mac, PC, or Linux machine — no internet, no API costs, no data leaving your computer. Common uses: private AI assistant, offline coding helper (via Continue extension), experimenting with open-source models, building apps without sending data to cloud APIs.

Is Ollama free?

Yes. Ollama itself is completely free and open-source (MIT license). The models (Llama, Mistral, etc.) are also free to download and use. The only cost is your hardware — you need enough RAM/VRAM to run the models.

What models can I run with Ollama?

100+ models including Llama 3.3 (Meta), Mistral, Phi-3 (Microsoft), Gemma 2 (Google), DeepSeek-R1, Qwen, CodeLlama, and more. See all models at ollama.com/library. Model size requirements vary — smaller models (1B–7B) run on most modern machines; 70B models need 40GB+ RAM or VRAM.

Can Ollama run on Windows?

Yes. Ollama supports Windows 10 and 11 with GPU acceleration via CUDA (NVIDIA GPU) or CPU inference. Apple Silicon Macs (M1–M4) and AMD GPUs on Linux are also well-supported. CPU-only inference is available on all platforms but is 5–10x slower than GPU.