r/LocalLLaMA

500 articles archived · Visit source ↗ · RSS

r/LocalLLaMA community 3d ago

Findings from troubleshooting p2p on 4x5060 ti bifurcation.

I dumped the last week deep diving this and I’m I’ve been using Linux for 14 years and am a cloud systems engineer with a focus on supported Linux infrastructure for a private cloud provider. Essentially, if you are using a single 4x4 bifurcation pcie x16 card inserted into your…

5
r/LocalLLaMA community 3d ago

How to distill my own models?

I've been using cloud provided models for agentic theorem proving a lot, and cost is becoming an issue for me. I have funding for hardware cost but I can't use them for LLM credits which put me in a unique situation where it might be cheaper to self-host models instead of paying…

29
r/LocalLLaMA community 3d ago

Took the plunge! (Minisforum MS-S1 Max)

With Apple prices entering the stratosphere, the recent Fable gov't rug pull, and the inevitable closed-model price increases, I decided to pick up a (lightly) used Minisforum MS-S1 Max with 128GB of memory. Comes with a 10-day return and a 3-month warranty. Paid the local equiv…

30
r/LocalLLaMA community 3d ago

Can Qwen3.6-35B-A3B on an RTX 3060 Replace Google Vision for Receipt-to-JSON Extraction?

I tried replacing Google Vision in my receipt pipeline with a local Qwen model. I had an old LINE message bot where I could send a receipt photo, it would go to Google Vision, get parsed into JSON, and saved in SQLite. Recently I tried again, but locally. Setup: RTX 3060 12GB…

8
r/LocalLLaMA community 3d ago

Upgraded my budget build to multi-GPU for inference

I added: 1x RTX 3090 - 610 USD 1x Arc A770 - 222 USD 1x PCIe x1 to 4x USB 3.0 PCIe riser New cpu cooler Specs: Modified Zalman Z9 Plus Case 2x Zotac RTX 3090 24 GB 1x Intel Arc A770 16 GB 48 GB DDR4 RAM AMD Ryzen 5 1600X MSI X370 SLI Plus All parts were purchased second hand…

37
r/LocalLLaMA community 3d ago

Nemotron-3-Super-120B-A12B (hybrid Mamba+MoE) holds perfect needle retrieval to 504K tokens on 4×3090

TLDR: The Mamba/SSM layers keep a constant-size recurrent state instead of a growing KV cache, so context is nearly free. Full needle retrieval at half a million tokens, fully on-GPU, ~71GB. The new imatrix gguf here…

29
r/LocalLLaMA community 3d ago

vulkan: make TP viable by pwilkin · Pull Request #25051 · ggml-org/llama.cpp

The legend Piotr has taken a pass at making Vulkan Tensor Parallel somewhat usable, really looking forward to seeing this evolve   submitted by   /u/TKGaming_11 [link]   [comments]

11
r/LocalLLaMA community 3d ago

Hello there! (again) i ported my kokoro enhancements so you can use them in your projects.

i made a web based and python based version of the enhancements i made to kokoro's controls. both are, of course, fully client side. if you have hardware acceleration turned on in your browser, kokoro runs on webgpu at about 40ms per generation. it's really fast. note: the…

36
r/LocalLLaMA community 3d ago

Local LLM Peeps

I am 80% done with a harness that works for local and API but is local first. The harness has some interesting logic around multiple agents which I’m holding back on until it is open source on GitHub. I have been local for 6 months and built out EVERYTHING I could think of to…

28
r/LocalLLaMA community 3d ago

What are people using for multi-model backends? What about swapping configs?

I am trying to plan and deploy a machine that serves models for coding, Hermes, and whatever else. It's got multiple GPUs in it, and I want the flexibility to run different configurations (i.e. I might want to run two smaller models when I'm using Hermes and doing some…

23
r/LocalLLaMA community 3d ago

"What should I do?" - consider post-training

This is in response to the common post where OP has acquired some cool hardware and is wondering what to do with it. The standard response is always (1) download model X, (2) benchmark it on tps, (3) share screenshots. I argue this is boring and intellectually lazy, and propose…

18
r/LocalLLaMA community 3d ago

Streaming medical STT running locally on a MacBook

Quick teaser of what I’ve been working on over the last few weeks: a streaming medical speech-to-text model that runs fully on-device. This demo is running locally on a MacBook through MLX. Still doing more evals, but planning to release the open weights next week.  …

22
r/LocalLLaMA community 3d ago

Book Review: Domain-Specific Small Language Models by Guglielmo Iozzia

Domain-Specific Small Language Models Guglielmo Iozzia Review by u/skiata I came across Domain-Specific Small Language Models ( https://www.manning.com/books/domain-specific-small-language-models ) by attending the author's talk at an ACM Tech Talk (…

19
r/LocalLLaMA community 3d ago

Getting real work out of a 4B local model: the distill-on-idle pipeline behind an on-device "memory" assistant

https://preview.redd.it/iiiqwt96tn9h1.png?width=3004&format=png&auto=webp&s=f02fba9f64e27ac91b2ae4cd478842106b294366 https://preview.redd.it/47cb5u96tn9h1.png?width=3024&format=png&auto=webp&s=b1cee93477970b8b0a636c37be657fecd38ba968…

7
r/LocalLLaMA community 3d ago

Why do people keep investing in Intel for AI?

If you get a good deal on some Xeons with a lot of memory bandwidth, or a cheap GPU for home inference, that's cool, no disrespect. But how in the hell are Wall Street types considering Intel part of the "AI picks and shovels" play? Who's buying Intel for their AI data centers?…

17
r/LocalLLaMA community 3d ago

8 Tesla T4 Cards, what should it do?

I have collected 8 Tesla T4 Datacenter Cards from a few retired VDI servers. I have one in a DEG1 and works ok on n its own. What should we do with the rest?   submitted by   /u/imonlysmarterthanyou [link]   [comments]

7
r/LocalLLaMA community 3d ago

What's one local AI workflow you wish you'd discovered sooner?

There are a lot of posts about the models and benchmarks, but I am more interested in the workflows that people use. What is one workflow that really saved you time or made your local LLM more useful? It could be anything—RAG, MCP, coding agents, organizing prompt, document…

23
r/LocalLLaMA community 3d ago

Planning small AI RIG, 5 X 5060ti 16GB, after selling my 5090

Tell me if it's a good idea or not, I have zotac solid 5090 with 128gb RAM, thinking of selling only 5090 and getting 5 x 5060ti 16gb also use these PCIE 4.0 x16 Extender Riser Cable, planning open rig for AI, is it good idea?   submitted by   /u/Specialist_Pea_4711…

30
r/LocalLLaMA community 3d ago

Gemma 4 12b needs glasses

Having a lot of fun using Gemma 4 as an assistant, but is growing frustrated with the poor default image resolution setting for image vision. Tasks like identifying smaller text in an image that Qwen 3.6 flies through, Gemma 4 are never able to decipher. Even larger overall…

31
r/LocalLLaMA community 3d ago

Combined RTX5080 & 4060 for inference ?

Hey, I currently use my RTX 4060 8G for inference with Qwen 3.6-35B-A3B Q8 (q8 for everything weight,value,key) max 60k context per agent (for quality over speed, with CPU &DDR4 offloading) but : I only get ~100pp & 20tg at max when context is still low on Qwen 3.6-35B-A3B Q8,…

38
r/LocalLLaMA community 3d ago

Made an interactive explainer about speculative decoding/MTP

  submitted by   /u/undefdev [link]   [comments]

36
r/LocalLLaMA community 3d ago

Help optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

Our company recently acquired a workstation with an RTX PRO 6000 Blackwell , and we're experimenting with local LLMs to reduce part of our Claude token usage. Right now we’re running Qwen3.6 27B MTP Q8_K_XL with llama.cpp on Windows 11 . I've been using both Claude Opus and…

13
r/LocalLLaMA community 4d ago

KLD is flawed in abliteration.

I've noticed while creating my abliteration engine that KL is a flawed metric because it can be represented so many different ways, it depends completely on eval prompts, and lots of people use first token KL to make their models appear better than others. So I'm curious what do…

10
r/LocalLLaMA community 4d ago

Anyone tried Ornith-1.0 9B?

Should I even give it a chance over "qwopus3.5 9b v3.5" or "qwopus3.5 9b coder"? anyone tried it??   submitted by   /u/BothYou243 [link]   [comments]

8
r/LocalLLaMA community 4d ago

Ornith 1.0 - terminology and concepts explained (basic)

I made a quick guide for myself while wanting to try the new models, so I share it with you. It's pretty basic, but it may be useful for new people here. I also published the repo with the open code config and the commands: https://github.com/facuHannoch/AI_Workflows-Ornith-1.0…

34
r/LocalLLaMA community 4d ago

Does llama cpp split mode tensor cause issues?

I split qwen 27b and Gemma 4 26b (moe) across a 5080, and 2x 5060ti. I noticed setting split mode to tensor mode will cause looping issues in OpenCode with tool calls or just through the reasoning traces. Anyone else get this or understand why? Split mode layer seems to work…

25
r/LocalLLaMA community 4d ago

For dual GPUs, will there be any big impact to inference speeds when running in PCIe 5.0 x8/x4 vs x8/x8?

I bought the Biostar Z890 Valkyrie because it was on sale and had three PCIe 5.0 slots connected to the CPU (x16 or x8/x8 or x8/x4/x4), which I thought would be great for running dual GPUs for LLM inference. The problem is that now I want to add a SATA expansion card to the…

25
r/LocalLLaMA community 4d ago

When you don't have a data center GPU

Please don't tell me someone is going to (yet again) reply with the longest finetune-merge name in eternity...   submitted by   /u/Iwaku_Real [link]   [comments]

4
r/LocalLLaMA community 4d ago

Good YouTube channels for local LLM news and development?

Sometimes I'd prefer chilling on the couch and learning instead of reading. I've searched on YouTube and most seem like clickbait and slop. Thanks   submitted by   /u/6jarjar6 [link]   [comments]

5
r/LocalLLaMA community 4d ago

Stop waiting for Qwen3.7 Openweights.

Ornith-1.0, a family of open-source LLMs specialized for agentic coding. Ornith-1.0 spans the full parameter sizes, including 9B Dense, 35B MoE, and 397B MoE. It achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks. Hugging Face:…

36
r/LocalLLaMA community 4d ago

Built an open source local first Kanban workflow for running AI coding agents without babysitting every step

I’ve been building BatonBot, a local first app for running AI coding workflows with less babysitting. The problem I kept running into, especially with local models, is that coding agents can be useful but the workflow gets slow: start task → wait → check output → fix next issue…

10
r/LocalLLaMA community 4d ago

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything…

24
r/LocalLLaMA community 4d ago

US Govt to individually approve who gets GPT 5.6.

  submitted by   /u/AtlanticHM [link]   [comments]

16
r/LocalLLaMA community 4d ago

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

We find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting . JetSpec reaches up to 9.64× end-to-end speedup on MATH-500 and 4.58× on open-ended chat while keeping lossless. With…

12
r/LocalLLaMA community 4d ago

Qwen 3.6 27b GLM 5.2 fine-tune?

Hi everyone, Since both models are open weights and GLM seems to find that secret to frontier model reasoning, why don't we see any Qwen GLM finetune yet? Is it because GLM 5.2 is recent and finetune and datasets take time or the community is just not interested in the finetune?…

28
r/LocalLLaMA community 4d ago

How I'm handling per-agent isolation and environment lifecycle in a harness-agnostic orchestration library

This is my third post about designing an orchestration library for agents. I want to share the architecture decisions as I go and to put a solution out there in case you have the same problem, but also to hear what you think. Agent's environment: workspace, runtime, and…

27
r/LocalLLaMA community 4d ago

DGX Spark OS lifetime?

I think of purchasing 2 DGX Sparks for my office (because a 700+W workstation would be intolerable) for LLM-centric work (inference only, no fine-tuning). I know the OS is based on Ubuntu 24.04. Has Nvidia ever disclosed what is the lifetime of the OS? Meaning, is there a chance…

17
r/LocalLLaMA community 4d ago

Prices of graphic cards are going crazy, should I buy a second card though?

A few months ago, I bought a RX 7900 XTX 24g to start toying with local LLM, at 900€ new. Little I knew that now I want to add a second card to my rig, but prices have gone insane! Adding a new 7900 XTX would cost me 1200€ as new now, used price is around 900€ now, and the last…

38
r/LocalLLaMA community 4d ago

3090 Idle Consumption reset

I was about to say how much better 595.71.05 drivers were at reverting my dual 3090s to a lower power state when idle, with my 3090s dropping down to 13w to 15w. Today, however, one of the cards seems stuck at 24w to 30w with zero activity and fan at 0%. I've stopped all running…

16
r/LocalLLaMA community 4d ago

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Everything runs locally in your browser using custom WebGPU kernels written by Fable 5 (before it was shut down) and Opus 4.8. The video was recorded on my M4 Max. Model: LiquidAI/LFM2.5-230M ( GGUF ) Demo: https://huggingface.co/spaces/webml-community/lfm2-webgpu-kernels  …

37
r/LocalLLaMA community 4d ago

Report: Apple to skip M6 Pro/Max chips, fast-track M7 for local AI

  submitted by   /u/fallingdowndizzyvr [link]   [comments]

22
r/LocalLLaMA community 4d ago

Fast medical RAG API to give your local LLMs access to facts

I created a simple RAG API using medical Wikipedia articles that you can point your agent to and use freely. It may be useful in allowing your local LLMs access to medical facts they might not be able to recall from their weights. I'm aiming for subsecond responses but cannot…

7
r/LocalLLaMA community 4d ago

Which model for technical documentation?

Looking to create high level / low level designs (software), based on existing templates/examples, cross reference code, use mcp to download confluence/jira data - also plug into agentic ‘coding’ frameworks opencode . I mostly use opus 3.6 with Kiro-cli , but I want my data…

32
r/LocalLLaMA community 4d ago

rtx 6000 pro owners, do you regret?

I found the last dealership in my area that has rtx 6000 pro available, i already wanted to buy it 6 months ago when it was around $8k, now prices increased to $13k ish. Regardless the price, are you happy with it? I assume you are using qwen3.6 27b, is it worth it? Please share…

9
r/LocalLLaMA community 4d ago

GLM 5.2 on consumer hardware

I tried out the unsloth quants of GLM 5.2 on still "consumer-ish" hardware: 32C Zen5 Threadripper Pro 9975 WX, Asus WRX90E-SAGE-SE PCIe Gen5, 512GB DDR5 ECC RAM @ 4800MHz, dual RTX 5090. This machine was put together pre-RAMpocalypse, and by then not exceedingly expensive…

19
r/LocalLLaMA community 4d ago

Tensor Split Fix for intel GPU's llama.cpp release b9788

sycl : support --split-mode tensor #24152 I'd like to see some numbers if anyone has 2xintel gpus and tries this out   submitted by   /u/Bulky-Priority6824 [link]   [comments]

10
r/LocalLLaMA community 4d ago

Ornith-1.0 released on Hugging Face

Including 9B Dense, 31B Dense, 35B MoE, and 397B MoE and reporting sota on different benchmark (let's see if this holds). https://huggingface.co/collections/deepreinforce-ai/ornith-10   submitted by   /u/paf1138 [link]   [comments]

26
r/LocalLLaMA community 4d ago

It turns out Bash is All You Need to write a language model REPL (and jq and curl)

While working on an self-educational exercise tinkering with local models and trying my hand at setting up agents, I went down a rabbit hole: to see how far I could build a custom agent REPL loop using exclusively command-line building blocks and stripping out dependencies…

20
r/LocalLLaMA community 4d ago

New Apple Memory Prices

https://preview.redd.it/00o5xtaznf9h1.png?width=696&format=png&auto=webp&s=60a3306ea86a9b0d1f58c435b7dbb0a42761a415 Apple raised the prices across the product line this morning:…

33
r/LocalLLaMA community 4d ago

siq1 on kebab bench

tested my model on kebab bench and it performs very well: https://huggingface.co/spaces/AlexWortega/hermes-agent-zerogpu   submitted by   /u/Mysterious_Hearing14 [link]   [comments]

29

Findings from troubleshooting p2p on 4x5060 ti bifurcation.

How to distill my own models?

Took the plunge! (Minisforum MS-S1 Max)

Can Qwen3.6-35B-A3B on an RTX 3060 Replace Google Vision for Receipt-to-JSON Extraction?

Upgraded my budget build to multi-GPU for inference

Nemotron-3-Super-120B-A12B (hybrid Mamba+MoE) holds perfect needle retrieval to 504K tokens on 4×3090

vulkan: make TP viable by pwilkin · Pull Request #25051 · ggml-org/llama.cpp

Hello there! (again) i ported my kokoro enhancements so you can use them in your projects.

Local LLM Peeps

What are people using for multi-model backends? What about swapping configs?

"What should I do?" - consider post-training

Streaming medical STT running locally on a MacBook

Book Review: Domain-Specific Small Language Models by Guglielmo Iozzia

Getting real work out of a 4B local model: the distill-on-idle pipeline behind an on-device "memory" assistant

Why do people keep investing in Intel for AI?

8 Tesla T4 Cards, what should it do?

What's one local AI workflow you wish you'd discovered sooner?

Planning small AI RIG, 5 X 5060ti 16GB, after selling my 5090

Gemma 4 12b needs glasses

Combined RTX5080 & 4060 for inference ?

Made an interactive explainer about speculative decoding/MTP

Help optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

KLD is flawed in abliteration.

Anyone tried Ornith-1.0 9B?

Ornith 1.0 - terminology and concepts explained (basic)

Does llama cpp split mode tensor cause issues?

For dual GPUs, will there be any big impact to inference speeds when running in PCIe 5.0 x8/x4 vs x8/x8?

When you don't have a data center GPU

Good YouTube channels for local LLM news and development?

Stop waiting for Qwen3.7 Openweights.

Built an open source local first Kanban workflow for running AI coding agents without babysitting every step

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

US Govt to individually approve who gets GPT 5.6.

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Qwen 3.6 27b GLM 5.2 fine-tune?

How I'm handling per-agent isolation and environment lifecycle in a harness-agnostic orchestration library

DGX Spark OS lifetime?

Prices of graphic cards are going crazy, should I buy a second card though?

3090 Idle Consumption reset

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Report: Apple to skip M6 Pro/Max chips, fast-track M7 for local AI

Fast medical RAG API to give your local LLMs access to facts

Which model for technical documentation?

rtx 6000 pro owners, do you regret?

GLM 5.2 on consumer hardware

Tensor Split Fix for intel GPU's llama.cpp release b9788

Ornith-1.0 released on Hugging Face

It turns out Bash is All You Need to write a language model REPL (and jq and curl)

New Apple Memory Prices

siq1 on kebab bench