r/LocalLLaMA

500 articles archived · Visit source ↗ · RSS

r/LocalLLaMA community 14d ago

vLLM has a new streaming parser for Qwen3+ available in nightly

The new parser reportedly fixes the issues many were seeing with Qwen3.6-27b stopping mid turn, as well as failing streaming tool calls due to chunk boundaries. The mid turn stopping is especially annoying when trying to use the model for agentic workflows. I've not seen it…

22
r/LocalLLaMA community 14d ago

"My son is a genius coder" - honest Alpha Tester review

"It's not slop - it's an art" - Grandma. Introducing you my few weeks brainstorming and writing of the code. I was rewrite everything my AI was creating so basically it's my own creation. Brain Calculator Pro™ — the calculator that made your calculator obsolete. AI-powered…

11
r/LocalLLaMA community 14d ago

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors | Alexander Hägele

This looks very promising in terms of simplifying and accelerating fine-tuning.   submitted by   /u/Thrumpwart [link]   [comments]

37
r/LocalLLaMA community 14d ago

Cheapest hardware for Qwen 3.6: both 27B and 35B-A3B

- "Qwen 3.6/3.5 27b > Qwen 3.6/3.5 35b > Gemma4 31b > Qwen 3.5 9b > Gemma4 12b > Gemma4 26b", people say - "Qwen 3.6 for coding & Agentic, Gemma4 for human sounding text", people say  So I have been eyeing the RTX 3090 24 GB (or sometimes its cheaper Chinese companion…

30
r/LocalLLaMA community 14d ago

Finally - 4xRTX 5060TI

nvtop showing clocks and PCIe speed while running gpu_burn I wrote a while ago about my plans to put together a quad 5060ti 16gb based system after finding them nicely discounted. Everything got delayed due to issues with CPU seating (damn re-used stock cooler with plastic push…

32
r/LocalLLaMA community 14d ago

Reason to run local agents instead #645

  submitted by   /u/ToastFetish [link]   [comments]

18
r/LocalLLaMA community 14d ago

Stop using Ollama

  submitted by   /u/zxyzyxz [link]   [comments]

12
r/LocalLLaMA community 14d ago

Maybe dumb question, but how do you serve multiple users with the full context length?

After experimenting with llama.cpp, I'm wondering a thing. Let's say we have an LLM with a context size of 128k. Now let's say we want have up to 8 parallel users, and we want to provide each client with the full context capabilities. With llama.cpp, how does that work? AFAIK it…

20
r/LocalLLaMA community 14d ago

Local VibeCoding is a lot of fun..

Hi everyone! I don’t consider myself a professional, even though my current position is officially called "programmer." I’ve been writing code for many years, using different languages and technologies, most of which I’ve already forgotten) I decided to put together (to…

37
r/LocalLLaMA community 14d ago

We trained a cybersecurity-focused Mythos like LLM open weights on HuggingFace

We built OpenMythos for the Build Small Hackathon an open-source LLM trained specifically for cybersecurity tasks. Wanted to share our training approach since the RLVR setup was non-trivial and might be interesting to people doing similar domain-specific fine-tuning. The problem…

7
r/LocalLLaMA community 14d ago

Evalatro: an open benchmark where LLMs play the real Balatro

Hey! I made Evalatro - an open benchmark where your LLMs play actual Balatro. Real game. It started because I kept asking Claude to help me beat levels while playing (yeah, I'm too weak). I'd just throw screenshots at it and ask for tactics. Then the idea grew into something…

21
r/LocalLLaMA community 14d ago

What do you guys think about Unsloth Studio?

As a person who has gone through more AI frontend than one goes through socks, I have really appreciated the Unsloth frontend. It's anything I could ever need and it supports Diffusion Gemma! It has easy options to enable tensor parallelism and much more. Have you guys tried it…

33
r/LocalLLaMA community 14d ago

I think we need a /LocalHarnessLLM or something ...

LM Studio Hermes Qwen Code Odysseus Open Claw Open Code Claude Code (and then IDEs w/ agentic capabilities) Continue Rider VS Code And a dozen others I'm sure ... Would love a place to discuss these? If not a new subreddit, a new discord section in localllama discord? I've made…

24
r/LocalLLaMA community 14d ago

Local coding agents are good now, but only if you babysit them

Local coding agents are finally useful for me, but I still can’t just leave them alone. They are great for small fixes, reading a repo, changing files, and doing boring code work. But if I give them too much freedom, they start touching random stuff, making nice looking broken…

26
r/LocalLLaMA community 14d ago

I made a game where you convince an AI model that reality is a simulation.

Progress update: Showed you all my demo last week, had some great conversations with some very smart folk, and spent days fixing bugs and trying things out. And now, I humbly present to you: Simulation Simulator! A chat simulator game that bundles a local LLM inside Unity, and…

5
r/LocalLLaMA community 14d ago

About the Rio model

As a Brazilian, I was proud that a Brazilian team was capable to bring innovation and a useful model to the table. It was a cold water bath what came next with the wrong model uploaded.  That is a chance that it is real and it would be a major improvement for local AI. I…

18
r/LocalLLaMA community 14d ago

Buying AI accelerators/GPUs in China...

Bit of a long-shot this, but happens I'll be in China next week. Just wondering if there are any Chinese graphics cards/AI accelerators I should be trying to buy when I'm there? :-). I would be looking for something that let me run inference big models (so, lots of (V?)RAM), but…

10
r/LocalLLaMA community 14d ago

archex: local-first, deterministic code-context for AI agents — no API key, no telemetry (Apache 2.0)

archex turns a repo into a ranked, token-budgeted context bundle for coding agents: the symbols, imports, dependency-graph neighbors, and provenance the model needs, assembled before it reasons. It returns context, not an answer — your local model still does the thinking. The…

24
r/LocalLLaMA community 15d ago

React Native ExecuTorch now runs Gemma 4 (Vulkan and MLX accelerated)

We've integrated Gemma 4 into react-native-executorch . You can now run it fully offline in your React Native app, with GPU acceleration via the Vulkan delegate on Android and the MLX delegate on Apple Silicon. Link to the attached demo app here .   submitted by  …

32
r/LocalLLaMA community 15d ago

Tower-Plus-72B-Ultra-Uncensored-Heretic, a Model That Support 22 Languages Making it Great for Multilingual Tasks and is Especially Strong on Translation Related Workflows Where No Censorship Is Essential, Now Ultra Uncensored With 5/100 Refusals!

Safetensors: https://huggingface.co/llmfan46/Tower-Plus-72B-ultra-uncensored-heretic GGUFs: https://huggingface.co/llmfan46/Tower-Plus-72B-ultra-uncensored-heretic-GGUF Find all my models here: HuggingFace-LLMFan46   submitted by   /u/LLMFan46 [link]   [comments]

37
r/LocalLLaMA community 15d ago

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

So every time I pick a model for a feature or random use-case I have I end up having like 12 tabs open — usually OpenRouter for price and context, Artificial Analysis for benchmarks, Design Arena for the UI/frontend Elo if thats relevant, a status/model page for throughput or…

34
r/LocalLLaMA community 15d ago

Why there is a lack of new 100B-120B models?

GPT-OSS-120B was the first model of that family, which was followed by GLM-4.5-Air, Nemotron-3-Super, Qwen3.5-122B, Mistral-Small-4-119B. However, all models are at least 3 months old (10 months for GPT-OSS-120B) and all latest releases are either 25B-35B (Gemma4, Qwen3.6) or…

33
r/LocalLLaMA community 15d ago

People kept saying my comments sounded AI-generated, so I built this

https://preview.redd.it/bh8ar833gf7h1.png?width=970&format=png&auto=webp&s=a20831233fdd6b3243adc16d19101d81878f185b I originally came to Reddit because I wanted to discuss LLMs. More specifically, I wanted to talk about context management, long conversations, memory systems,…

26
r/LocalLLaMA community 15d ago

I'm still surprised on how good the kv quantization has become

https://preview.redd.it/78b1nuc63f7h1.png?width=1164&format=png&auto=webp&s=e4b7202b92026083d470e340260165ff8503ee57 https://preview.redd.it/ryl4v2ym3f7h1.png?width=1167&format=png&auto=webp&s=9e429648a3582dcf6ac12b5286b437e64889a3a9 kv at q4_0 (even the drafter is q4_0 kv) and…

24
r/LocalLLaMA community 15d ago

Context window + project size + Aider?

Forgive the naivety of this post, I'm a noob, bear with me! If a project, understood as a set of files, is larger than the context window of a model, how do you fit it in? After doing some naive research, various major LLMs like Deepseek, Kimi, and company say the solution is…

32
r/LocalLLaMA community 15d ago

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

"Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV, needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache)." On the same hardware, generation speeds doubled and VRAM usage dropped significantly…

22
r/LocalLLaMA community 15d ago

What's the lesson chat?

  submitted by   /u/ill_be_productive [link]   [comments]

22
r/LocalLLaMA community 15d ago

An agent that plans with a frontier model but runs most of tokens locally (built it for my own dual-3090 rig)

For the past couple of months, I've been building a tool for my personal use. I have a dual RTX 3090 system which I wanted to use but the qwen 3.5/3.6 27B and Gemma 4 31B while being really good, just didn't have the taste or the ability that a frontier model has. OTOH, frontier…

38
r/LocalLLaMA community 15d ago

moar QAT stuff and hairy ticks

tldr; finally got to a point where we can publish some of the ggufs with a more accurate process. in these repos: https://huggingface.co/idkwhattoputherenow/gemma-4-12B-it-qat-q4_0-maxerr https://huggingface.co/idkwhattoputherenow/gemma-4-31B-it-qat-q4_0-maxerr this is a…

27
r/LocalLLaMA community 15d ago

UI/svg block rendering by ServeurpersoCom · Pull Request #24080 · ggml-org/llama.cpp

watch the video to see SVG fun   submitted by   /u/jacek2023 [link]   [comments]

7
r/LocalLLaMA community 15d ago

I ported EXL3 to run well on Apple Silicon - PonyExl3

Hi guys, Beam's here. After I revamped the chat interface in oMLX, I was playing with turboderp's exllamav3 in my RTX 4090 machine and I wonder why can't I run this on my M5/M1 Max - so I built one. https://github.com/beamivalice/PonyExl3 For those who don't know Exl3 - it's one…

19
r/LocalLLaMA community 15d ago

100M model recommendation?

Looking for model around the size of 100M, looking to see if it has improved since the last post on this topic from 2 years ago.   submitted by   /u/Ok-Internal9317 [link]   [comments]

5
r/LocalLLaMA community 15d ago

Command A Plus GGUFs posted

Support for Command A Plus and North Mini Code was added to llama.cpp this weekend. Unsloth has North Mini Code GGUFs, but I didn’t find anyone with up to date GGUFs for Command A Plus, so I converted and quantized it!   submitted by   /u/coder543 [link]   [comments]

12
r/LocalLLaMA community 15d ago

Made a macOS app that creates highly personal macOS apps. Works with models as small as Gemma 4 E2B

Apologies in advance as the video is demonstrating with GPT 5.4 mini (a local model would take too long for a video), however I’ve made the same app with Gemma 4 E4B. Been working on an open source project for a while called Ironsmith. The gist is you can create highly…

13
r/LocalLLaMA community 15d ago

Gemma 12b less than 10 watts 6.5pp 1.3tg

Google pixel 10 pro Termux Llamacpp version: 9639 (ef8268fee) $ ./llama.cpp/build_vulkan/bin/llama-cli -m storage/downloads/gemma-4-12b-it-UD-Q3_K_XL.gguf --model-draft storage/downloads/mtp-gemma-4-12b-it.gguf --temp 1.0 --top-p 0.95 --top-k 64 --spec-type draft-mtp…

5
r/LocalLLaMA community 15d ago

EAGLE support merged into llama.cpp

  submitted by   /u/Diablo-D3 [link]   [comments]

18
r/LocalLLaMA community 15d ago

Qwen 3.6 35B-A3B @ Q4 or Gemma 4 12B @ Q8?

Wondering how much model quantization matters here. Daily driver on my 32gb unified memory setup is the qwen model outputting ~15 tokens a second. Heard good things about the 12B Gemma 4 model so interested in trying it against my codebase. Given its size I can very comfortably…

28
r/LocalLLaMA community 15d ago

z.ai Poll on X: MIT-licensed open weights are losing

You can cast your vote here: https://x.com/ZixuanLi_/status/2065646648777416770#m Just to be clear: I am not urging or brigading anyone to vote specifically for MIT-licensed open weights. Please choose the option you genuinely prefer. I previously shared this in another post,…

27
r/LocalLLaMA community 15d ago

Nemotron - King of the Deep? Comparison of 4 models <=120B

Comparison was done on Strix Halo 128gb shared memory, Ubuntu 26.04, Lemonade Server, Vulkan backend. I often run larger models like gpt-oss 120B or qwen but their performance seems to degrate quickly once in deep waters... ah.. deep context. The most important quality to me is…

24
r/LocalLLaMA community 15d ago

Voice-to-voice chatbot update

I've been working on this after hours for a few months continuously improving it. Now at a point where the chatbot is close to real-time (thanks to SSE streaming) and also interruptible while preserving context of what was last said. 100% local and powered by Qwen3.5-397B…

33
r/LocalLLaMA community 15d ago

Gemma 4 models benchmarked on with Triple GPU

Hearing good things about Gemma 4. Ran a few models across my llama box. Kubuntu 26.04 OS. AMD Ryzen 5 3600 6-core CPU. 48 GiB of DDR4 3600 Mhz RAM. Nvidia GTX-1070 at 8GiB VRAM ( X 3 ) with 24GiB total VRAM. GPUs have power limit set to 120, 121, 122 watts using: sudo…

29
r/LocalLLaMA community 15d ago

Help with resources for using LLMs as fictional characters

Hey ya'll, I'm an ex-cognitive scientist turned NLP Data Scientist by day, and science fiction author by night. I want to bring fictional characters in my prose to life with Local LLMs, and I'm looking for the best resources out there for doing this kind of work (datasets,…

10
r/LocalLLaMA community 15d ago

Which is the better local mobile TTS: Kokoro or Supertonic?

I saw a few posts saying that Kokoro is better, but they both sound pretty good in their demos. How good are they in production, though?   submitted by   /u/Exact_Law_6489 [link]   [comments]

27
r/LocalLLaMA community 15d ago

Anyone know how to turn off download images when compiling llama.cpp?

I noticed that the recent build environment for llama.cpp downloads various images during compilation for the UI. Like "pwa-512x512.png". How can I turn this off? I already have "-DLLAMA_CURL=OFF".   submitted by   /u/fallingdowndizzyvr [link]   [comments]

6
r/LocalLLaMA community 15d ago

Strange numbers of pp and tg rx7900xtx on ROCm and Vulcan with Qwen3.6-27b nonMTP and MTP

So I'm getting very unsatisfactory results of running this model locally. Item Current OS Ubuntu 24.04.4 LTS Linux kernel 6.8.0-124-generic GPU RX 7900 XTX / gfx1100 llama.cpp b9630 / 8ed274ef4 ROCm 7.2.4 AMD driver 6.16.13 Vulkan API 1.4.330 , Mesa 26.0.0-devel Raw Backend…

33
r/LocalLLaMA community 15d ago

Gemma 4 12B native encoder free voice input utilization suggest?

Hey everyone,  Like many of you, I’m looking into the newly released Gemma 4 12B to build a native speech-to-speech experience. Because of its unique encoder-free architecture, completely skipping the traditional STT bottleneck could be possible.  Right now, my…

20
r/LocalLLaMA community 15d ago

Nex claims Rio 3.5 is Nex 2.5 PRO in trench coat

  submitted by   /u/Specter_Origin [link]   [comments]

18
r/LocalLLaMA community 15d ago

Quality evaluation of quants with limited time or tokens

About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3…

36
r/LocalLLaMA community 15d ago

How to Run AI Locally: The Complete Beginner's Guide (2026)

Since local AI is booming and more people come and ask the same questions, I created a guide.   submitted by   /u/totosse17 [link]   [comments]

9
r/LocalLLaMA community 15d ago

Qwen 27B Q6/Q8 KV + MTP at 256K on DGX Spark / GB10, tok/s?

Has anyone tested Qwen3.6-27B on NVIDIA DGX Spark / GB10 or similar systems at 256K context? I know it's a dense model, but I'm curious how it performs with MTP enabled. Looking for real numbers with: Q6/Q8 quant Q8 KV cache MTP/speculative decoding 256K context Mainly…

31

vLLM has a new streaming parser for Qwen3+ available in nightly

"My son is a genius coder" - honest Alpha Tester review

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors | Alexander Hägele

Cheapest hardware for Qwen 3.6: both 27B and 35B-A3B

Finally - 4xRTX 5060TI

Reason to run local agents instead #645

Stop using Ollama

Maybe dumb question, but how do you serve multiple users with the full context length?

Local VibeCoding is a lot of fun..

We trained a cybersecurity-focused Mythos like LLM open weights on HuggingFace

Evalatro: an open benchmark where LLMs play the real Balatro

What do you guys think about Unsloth Studio?

I think we need a /LocalHarnessLLM or something ...

Local coding agents are good now, but only if you babysit them

I made a game where you convince an AI model that reality is a simulation.

About the Rio model

Buying AI accelerators/GPUs in China...

archex: local-first, deterministic code-context for AI agents — no API key, no telemetry (Apache 2.0)

React Native ExecuTorch now runs Gemma 4 (Vulkan and MLX accelerated)

Tower-Plus-72B-Ultra-Uncensored-Heretic, a Model That Support 22 Languages Making it Great for Multilingual Tasks and is Especially Strong on Translation Related Workflows Where No Censorship Is Essential, Now Ultra Uncensored With 5/100 Refusals!

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

Why there is a lack of new 100B-120B models?

People kept saying my comments sounded AI-generated, so I built this

I'm still surprised on how good the kv quantization has become

Context window + project size + Aider?

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

What's the lesson chat?

An agent that plans with a frontier model but runs most of tokens locally (built it for my own dual-3090 rig)

moar QAT stuff and hairy ticks

UI/svg block rendering by ServeurpersoCom · Pull Request #24080 · ggml-org/llama.cpp

I ported EXL3 to run well on Apple Silicon - PonyExl3

100M model recommendation?

Command A Plus GGUFs posted

Made a macOS app that creates highly personal macOS apps. Works with models as small as Gemma 4 E2B

Gemma 12b less than 10 watts 6.5pp 1.3tg

EAGLE support merged into llama.cpp

Qwen 3.6 35B-A3B @ Q4 or Gemma 4 12B @ Q8?

z.ai Poll on X: MIT-licensed open weights are losing

Nemotron - King of the Deep? Comparison of 4 models <=120B

Voice-to-voice chatbot update

Gemma 4 models benchmarked on with Triple GPU

Help with resources for using LLMs as fictional characters

Which is the better local mobile TTS: Kokoro or Supertonic?

Anyone know how to turn off download images when compiling llama.cpp?

Strange numbers of pp and tg rx7900xtx on ROCm and Vulcan with Qwen3.6-27b nonMTP and MTP

Gemma 4 12B native encoder free voice input utilization suggest?

Nex claims Rio 3.5 is Nex 2.5 PRO in trench coat

Quality evaluation of quants with limited time or tokens

How to Run AI Locally: The Complete Beginner's Guide (2026)

Qwen 27B Q6/Q8 KV + MTP at 256K on DGX Spark / GB10, tok/s?