LM Studio Guide: Run Local AI Models With a GUI (2025)
LM Studio guide 2025 — download the desktop app (Mac/Windows/Linux), browse and install models via the built-in HuggingFace browser (Llama 3.3/Mistral/Phi-3/DeepSeek GGUF), chat with the built-in UI, configure temperature + context + GPU layers, and start a local OpenAI-compatible API server at localhost:1234. Free, private, no CLI needed.
1. What is LM Studio?
LM Studio is a free desktop GUI application for running large language models locally. Unlike Ollama (command-line), LM Studio has a visual interface for browsing, downloading, and chatting with models — no terminal commands required. Everything runs on your own hardware: no internet needed once models are downloaded, no API costs, no data leaving your machine.
Key features
Platform support
LM Studio vs Ollama at a glance
| LM Studio | Ollama | |
|---|---|---|
| Interface | GUI (visual) | CLI (terminal) |
| Model browser | ✓ Built-in HF browser | Manual pull commands |
| Chat UI | ✓ Built-in | Requires Open WebUI |
| API server | ✓ localhost:1234 | ✓ localhost:11434 |
| Easiest for | Non-technical users | Developers |
| Model format | GGUF | GGUF + Ollama native |
2. Install LM Studio (3 steps)
Installation is straightforward — LM Studio is a regular desktop application with no CLI setup required.
Go to lmstudio.ai and download for your platform
Navigate to lmstudio.ai and click Download. Choose your platform: Mac (Apple Silicon or Intel), Windows, or Linux. The download is around 200–400MB.
Open LM Studio — no installation wizard
On Mac: drag LM Studio to your Applications folder and double-click to open. On Windows: run the downloaded installer. LM Studio opens directly as a desktop application.
On first launch: select GPU acceleration
LM Studio will detect your hardware and suggest acceleration settings. Choose: Metal for Apple Silicon, CUDA for NVIDIA GPU (Windows/Linux), or CPU for others. You can change this later in Settings.
3. Find and download a model (4 steps)
LM Studio's built-in Discover tab connects directly to HuggingFace — browse and download models without leaving the app. Choose a model that fits your hardware (see recommendations below).
Click the search icon (top left) or go to the Discover tab
The Discover tab is the magnifying glass icon in the left sidebar. This opens the built-in HuggingFace model browser.
Search for a model
Try searching: llama-3.2, mistral-7b, phi-3-mini, or deepseek-r1. Results come from HuggingFace and show available GGUF variants.
Filter by size and quantization
Use the filters to show only models that fit your VRAM or RAM. LM Studio can highlight which variants fit your hardware (shown in green). Filter by format: GGUF.
Click Download next to a specific variant
Choose a quantization level: Q4_K_M (4-bit, recommended — best balance of quality and size), Q5_K_M (slightly better quality, larger file), or Q8_0 (near-lossless, requires more RAM). Q4_K_M is the most popular choice.
Recommended models by hardware
| Hardware | Recommended models |
|---|---|
| 8GB RAM/VRAM | Llama 3.2 3B Q4, Phi-3 Mini 3.8B Q4 |
| 16GB RAM/VRAM | Llama 3.1 8B Q4, Mistral 7B Q4 |
| 32GB+ (Mac M-series) | Llama 3.3 70B Q4, DeepSeek-R1 7B |
| 64GB+ RAM | Llama 3.3 70B full precision |
4. Chat with a model
LM Studio includes a built-in chat interface — you don't need to install Open WebUI or any external tool. Just load a model and start typing.
Go to the Chat tab (speech bubble icon)
Click the chat/speech bubble icon in the left sidebar to open the chat interface.
Click “Select a model to load” → choose your downloaded model
A dropdown at the top of the chat shows all your downloaded models. Click it and select the model you want to use.
Wait for the model to load (5–30 seconds)
The model loads into memory on first use. A progress bar shows loading status. Larger models take longer; once loaded, subsequent responses are fast.
Type your message and press Enter
The model responds locally — streamed in real time. No internet required, no API costs, no data sent anywhere.
Optional: edit the System Prompt
At the top of the chat panel, you can set a System Prompt — instructions that define the model's behavior and role (e.g., “You are a helpful coding assistant. Be concise and show code examples.”)
5. Configure and optimize the model
While chatting, the right panel shows model configuration options. Adjusting these can significantly affect both quality and performance.
Context Length
How much conversation history the model can remember, measured in tokens. Range: typically 4,096 to 32,768. Higher context uses more RAM. For most chat sessions, 4,096–8,192 is sufficient.
Temperature
Controls how creative or focused the model's responses are.
GPU Layers
How many model layers to offload to GPU. More layers = faster inference, but needs more VRAM. Set as high as your VRAM allows. LM Studio often auto-detects the right value based on your GPU.
Threads
How many CPU threads to use for inference (relevant when running on CPU or with partial GPU offloading). A good default is half your CPU's logical core count. On Mac, LM Studio handles this automatically via Metal.
6. Use LM Studio as a local API server
LM Studio includes a local server that mimics the OpenAI API format. Any code that calls OpenAI can be pointed at LM Studio with just two line changes — the API is identical.
Click “Local Server” (left sidebar, server icon)
The server icon looks like a network/“</„ icon in the sidebar. Click it to open the server panel.
Select your loaded model
Choose the model you want the server to serve. If you already loaded a model in Chat, it will appear here.
Click “Start Server” → server starts at http://localhost:1234
The green indicator confirms the server is running. The API base URL is http://localhost:1234/v1.
OpenAI Python SDK — change just the base URL
Use the OpenAI Python SDK — change base_url to point at LM Studio. The rest of your existing OpenAI code stays identical.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="not-needed" # required field, value doesn't matter
)
response = client.chat.completions.create(
model="local-model", # LM Studio ignores this and uses the loaded model
messages=[{"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"}]
)
print(response.choices[0].message.content) curl alternative
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "local", "messages": [{"role": "user", "content": "Hello!"}]}' 7. LM Studio vs Ollama — when to use each
Both are free, private, and offline-capable. The right choice depends on your workflow and comfort with the command line.
Choose LM Studio if…
Choose Ollama if…
Monitor AI tool status at Prismix
Track Ollama, LM Studio, and all your local AI tools at Prismix — know instantly whether issues are with your local setup or upstream services. Get free alerts so you're the first to know.
FAQ
What is LM Studio?
LM Studio is a free desktop application for running open-source LLMs locally on your Mac, Windows PC, or Linux machine. It provides a visual interface to browse and download models from HuggingFace, a built-in ChatGPT-like chat interface, and an OpenAI-compatible local API server at localhost:1234.
Is LM Studio free?
Yes. LM Studio is completely free to download and use. The models (Llama, Mistral, Phi, DeepSeek) are also free to download. The only cost is your hardware — you need enough RAM/VRAM to run the models you choose.
LM Studio vs Ollama — which should I use?
LM Studio is better for visual browsing and non-technical users — it has a built-in model browser, chat UI, and server without any terminal commands. Ollama is better for developers who want CLI integration, shell scripting, and use with tools like the Continue extension in VS Code. Both are free and provide a local OpenAI-compatible API.
What models can I run in LM Studio?
LM Studio supports GGUF format models from HuggingFace — including Llama 3.3, Mistral, Phi-3, DeepSeek-R1, Gemma 2, Qwen, and hundreds more. Model availability depends on your hardware — smaller quantized models (Q4_K_M, 4-bit) run on most modern machines with 8GB+ RAM.