r/LocalLLaMA

500 articles archived · Visit source ↗ · RSS

r/LocalLLaMA community 5h ago

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output. Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying.…

19
r/LocalLLaMA community 6h ago

I Hate Dario Amodei, and everything he stands for.

I am so incredibly sick of this guy‘s fear mongering about open source while fundamentally misunderstanding how it actually works. He recently dropped some arguments that are so completely detached from reality, it honestly feels like he’s never even touched a local model in his…

31
r/LocalLLaMA community 6h ago

Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.

  submitted by   /u/AnticitizenPrime [link]   [comments]

18
r/LocalLLaMA community 7h ago

Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!

I've been super impressed with Krea-2-Turbo. It can generate high quality images in ~3 seconds. The quality is quite good compared to other local AI image gen models. Now, I don't want to make you watch or click a you tube video, so I'll just give these clear instructions on how…

5
r/LocalLLaMA community 9h ago

on Dario’s statement

  submitted by   /u/turtle-toaster [link]   [comments]

32
r/LocalLLaMA community 10h ago

It’s time, Sam, it’s time.

I mean….. I’m no CEO…. but it seems like this would be the absolute perfect time to drop a super powerful GPT-OSS-2 to throw a big ol’ wet blanket on Anthropic’s IPO. It doesn’t need to be like frontier or anything, just a 20b and a 120b that is as fast as the old versions, add…

31
r/LocalLLaMA community 10h ago

An NGO for digital freedom of thought

Disclosure: I'm the chairman of this association and we're in the founding process (legal stuff, besides that we're settled). Also: I'm writing this manually, not via AI. Out of respect for this subreddit. I don't mean to spam here, but perhaps the information / opportunities I…

26
r/LocalLLaMA community 11h ago

DeepSeek V4, PR merged into llama.cpp !

The PR : https://github.com/ggml-org/llama.cpp/pull/24162 All to git pull, cmake , and download GGUFs ! A vos marques, prêt, partez !   submitted by   /u/Squik67 [link]   [comments]

4
r/LocalLLaMA community 11h ago

Qwen3-tts.cpp + Compose Desktop GUI

I improved my qwen3-tts.cpp implementation to be about 5x realtime on my RTX 5080. It is GGML based, so it should compile and run anywhere - however I only tested it with CPU & CUDA under Windows & Linux: https://github.com/Danmoreng/qwen3-tts.cpp Additionally I made a Desktop…

13
r/LocalLLaMA community 12h ago

Amodei: "Open Source Models Will Eat Your Children"

  submitted by   /u/johnnyApplePRNG [link]   [comments]

35
r/LocalLLaMA community 12h ago

Anthropic's Amodei: "Open Source models [could take us to] a very dangerous place."

  submitted by   /u/johnnyApplePRNG [link]   [comments]

4
r/LocalLLaMA community 13h ago

Samsung, SK hynix, Micron Sued in US Over Memory Price Fixing

  submitted by   /u/johnnyApplePRNG [link]   [comments]

15
r/LocalLLaMA community 14h ago

Effect of GLM 5.2 !!

All hail Z. Ai   submitted by   /u/Independent-Wind4462 [link]   [comments]

13
r/LocalLLaMA community 14h ago

Going from single GPU to dual GPU is nice but not in the way I expected

I was expecting what when doubling my VRAM from 24gb to 2x24gb I'd use higher quants with more context, and thus get smarter LLMs, but that's not what it ended up happening. At least for coding, I found that the difference in quality from, say, qwen 27B UD-Q4-XL to a Q6 or Q8 is…

21
r/LocalLLaMA community 14h ago

Instead of decentralized training effort we should build the “One dataset”

There are many threads here calling for united LLM training run of a new open model. Mainly, after govt. stunt of banning commercial frontier models. And also due to the lack of small-medium open-weight models releases lately. I genuinelly believe at some point we’ll have “SETI…

38
r/LocalLLaMA community 14h ago

Bolt Graphics GPU will have 2 DDR5 laptop DIMM slots

They have a few working prototypes, & are aiming for pre-production examples made by end of this year, & full production by Christmas 2027. Interesting specs: 5nm GPU "High performance CPU in GPU" on-card LPDDR5X as primary memory pool 2 DDR5 SODIMM slots for 'spill over'…

38
r/LocalLLaMA community 15h ago

Anyone else end up building a web access layer for local AI agents?

I've been running local models for most of my experiments, and I kept running into the same issue. The model lives locally, but everything it needs to interact with doesn't. Every new agent ended up with another GitHub client, another Reddit integration, another documentation…

10
r/LocalLLaMA community 15h ago

Mellum2 local deployments

Hey local community, I work at JetBrains with the team that trained Mellum2 models — 12B-2.5A LLMs. Those models are trained completely from scratch, targeting fast inference: our primary goal were H100/H200s prod deployments, but local deployments are good as well. We…

37
r/LocalLLaMA community 15h ago

NASA testing local LLM inference for future space missions

Red Hat published a blog post last week about an initiative I supported with NASA researchers at Johnson Space Center building a medical AI assistant. It's called the Crew Medical Officer Digital Assistant (CMO-DA) and the system runs LLMs and other models on local hardware with…

34
r/LocalLLaMA community 16h ago

Kimi and GLM on frontier code

  submitted by   /u/Charuru [link]   [comments]

36
r/LocalLLaMA community 17h ago

CPU-only GLM 5.2: Epyc and 512GB RAM

This is just a preview of some content I'm putting together to share with you all. I have a server I've put together and I'm testing the 4-bit version of GLM 5.2 (GLM-5.2-UD-Q4_K_XL). This is an Epyc Rome 7452 with 512GB of RAM. TLDR: This is the unedited prompt, response and…

28
r/LocalLLaMA community 17h ago

Any good uses for a 192 GB DDR3 Server in the LLM world?

I've been gifted this old IBM System X V4 with a dual Xeon E5-2640 [6c12t @ 2.7 GHz] and a whooping 192 GB of DDR3 1666 ECC RAM There's a gen 2 x16 PCi-E port in there as well so it can take a single GPU... Does anyone have some fun ideas on what to do with this system? It's…

15
r/LocalLLaMA community 17h ago

Is it ever possible to have a malicious LLM with a backdoor

I was just brainstorming of possibilities that the LLMs behave differently than normal if trained to recognize a specific secret sentence , and then unlocks a backdoor of malicious behavior. This sounds to me very possible at first glance. Don't get me wrong, the risk is…

32
r/LocalLLaMA community 17h ago

Deepseek V4 Official Launch to be released mid-July with API price changes

Is this the official release for deepseek? I hope it has huge improvements https://preview.redd.it/dm5l0qn8k7ah1.png?width=694&format=png&auto=webp&s=12eadfd0a52c0f1a65bcd685f2cdbb29aff457be   submitted by   /u/jmorant555 [link]   [comments]

22
r/LocalLLaMA community 18h ago

Apparently you can skip entire transformer blocks at load time with minimal performance impact

The benefit is another trick to allow fitting a model that wouldn’t fit in your hardware otherwise. People currently rely on quantization, and this is just another tool that can be used for that purpose (and they can be used together as well) Following recent (very cool) papers,…

30
r/LocalLLaMA community 19h ago

DeepSeek V4 official version will be launch on mid-July

https://preview.redd.it/n7rwh262b7ah1.jpg?width=1024&format=pjpg&auto=webp&s=33d775b456843cd2dbd458de89384a6a7d6d87d1 Source: Email sent from deepseek (email only available for chinese user) used gpt image 2 translate image into english   submitted by  …

34
r/LocalLLaMA community 20h ago

DeepSeek V4 by am17an · Pull Request #24162 · ggml-org/llama.cpp

now you can run DeepSeek V4 locally   submitted by   /u/jacek2023 [link]   [comments]

26
r/LocalLLaMA community 21h ago

GLM 5.2 Q1_S vs Qwen 27B Q8

TL;DR; GLM-5.2 Q1_S beats Qwen 3.6 27B Q8, both run at KV Q8 edit: GLM run a K & V Q8, Qwen run with KV cache at full FP16., with preserve thinking on. Disclaimer : This is a hobby/amateur comparison with n=1, so go easy on it. I just thought it would be fun to share. The…

11
r/LocalLLaMA community 21h ago

LibreChat or OpenWebUI ?

Hello, I have a friend that while technical, it doesn't know too much about AI, I've helped them with the infrastructure setting and that works like a charm, but he's interested in a thing that where I don't have too much experience with, and that is flashy chat "do everything"…

14
r/LocalLLaMA community 21h ago

MiCA is now part of Hugging Face PEFT

Glad to share that MiCA, short for Minor Component Adaptation, has now been merged into the HuggingFace PEFT library. It is not yet included in the latest PyPI release, but you can already install it directly from PEFT main: pip install --upgrade…

18
r/LocalLLaMA community 22h ago

AMD MI210 64GB vs DCU K100 64GB

On the Chinese eBay there is a many DCU K100 64 GB GPU available for a very attractive price, between 6000 RMB and 19 000 (air or water cooled versions, new or second hands), and 15 000 to 20 000 for the AMD MI210 (4000-6000 RMB for the PCIE bridge). There is very little…

25
r/LocalLLaMA community 1d ago

Update: First Manual Results from Testing Procedural Skill Transfer in Small Models

Yesterday I posted an idea for testing whether a large model can transfer some of its procedural skill to a smaller model without fine-tuning. The short version of the idea was this: Small models are often not completely lacking knowledge. They know the syntax. They know the…

18
r/LocalLLaMA community 1d ago

Minimax M3 vs M2.7

M3 has been out for ~2 weeks now. Would love to hear feedback from those who have updated to M3 from M2.7.   submitted by   /u/rm-rf-rm [link]   [comments]

32
r/LocalLLaMA community 1d ago

High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps

I got GLM-5.2 NVFP4 running on four DGX Sparks at 128K context. This is still a niche/hacky setup, but it is now a real serving point rather than just a proof of life. Objective : A high quality 4-bit quant running on 4x spark. Model: https://huggingface.co/Mapika/GLM-5.2-NVFP4…

9
r/LocalLLaMA community 1d ago

MLX Fine-Tune Example Guide

A Local MLX Fine-Tuning Experiment Just finished a local LoRA fine-tune of a 7B instruction model on Apple Silicon, via MLX, teaching it a high-fantasy literary register (Gene Wolfe and Tolkien). This is a more rigorous version with more data of something I tried two years ago…

14
r/LocalLLaMA community 1d ago

I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers.

This is something I've been working on, I like playing around with smaller local models but found most agent harness's not well suited for them. The failure modes across different model family's tend to be the same: Failed tool calls Poor varication of environment variables Poor…

12
r/LocalLLaMA community 1d ago

Locally running mode turns an Image into a Cute Controllable Character you can Play as

This is a sequel to my last post here !! It meant a lot to have such positive feedback last time. This is the 800M version of the previous model. It still has a LOT of issues but the promise is the same. Working comfortably on consumer GPUs The context is increased to 12 latent…

32
r/LocalLLaMA community 1d ago

Qwen3.6-27B UD Q3 with kv at q8 is quite amazing for simple proof of concepts

Preface, technology is not my industry, but I am a very passionate poor man. So much so that I discovered 'AI' - ChatGPT in the beginning of 2025. So go easy on me, I only try. I kind of understand MOE vs. Dense models, MOEs are much forgiving when it comes to running as there…

22
r/LocalLLaMA community 1d ago

NPC Engine Using Local Models

I’ve been working on a game-agnostic NPC engine/backend based pretty heavily on SillyTavern-style architecture, and with smaller local models getting better and better, I honestly think this kind of thing could be the future of RPGs. Right now I’m using NVIDIA Parakeet 0.6 for…

22
r/LocalLLaMA community 1d ago

Tensor split performance on low-bandwidth (TB3) eGPUs, and a question

Hey everyone! I've got a pair of Morefine G1 4090M 16gb eGPUs connected at 40Gbps via TB3 (daisy-chained). I normally run them in layer split mode as it doesn't seem to need much bandwidth; I'm seeing around 1300t/s PP and 26t/s TG (35-40 with MTP), qwen3.6-27B @ Q4. Which is…

20
r/LocalLLaMA community 1d ago

Trying to understand why so many trash fine-tuned models on HuggingFace ...

The majority of these models do not perform even as well as the base model, not even worth wasting the disk space on HuggingFace server, Qwhoppass-27B-Mother-Ultimate-Lord, whatever... Seeing their proliferation and the booming AI job market, I think many of those are just for…

14
r/LocalLLaMA community 1d ago

Success story with MiMo-V2.5-GGUF:UD-Q5_K_XL

I don't see many stories about this model, but after several attempts (after I finished finally reconfiguring my cluster) I did something useful with it: it wrote a built-in llama.cpp tool for executing C++ code and using the results. Here's an exercise that MiMo V2.5 gave me to…

27
r/LocalLLaMA community 1d ago

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Follow-up to my previous Ornith-1.0-35B Q3_K_M post. I grafted a native MTP draft head onto the IQ4_XS body (head at Q6) for self-speculative decode, single GPU, llama.cpp: 1.3-1.35x single-stream decode (172.6 -> 233.8 tok/s). Next-token distribution is byte-identical to…

11
r/LocalLLaMA community 1d ago

China Has Matched Anthropic in Cybersecurity, Resetting AI Race

  submitted by   /u/pscoutou [link]   [comments]

30
r/LocalLLaMA community 1d ago

A lot of good M5 Max options available at Apple Refurbished

Just a heads-up. After Apple's price hike announcement, they added a bunch of top-of-the-line 14" M5 Pro/Max options to their refurbished website. If you got discouraged by the price hike, check out their refurbished store.   submitted by   /u/Hanthunius [link]  …

13
r/LocalLLaMA community 1d ago

The number 1 public enemy of open-source.

Dario's args: "Opensource you can see the source, here you cannot see inside the model" - yes you can that's literally the open weights part btw. - I cannot see the weights inside Claude, but I can GLM 5.2 - Models like Nemotron3 Ultra go further, all the data, training scripts,…

25
r/LocalLLaMA community 1d ago

Script to monitor llama cpp and analyze memory usage

My goal has always been to be productive with commodity hardware. So far my workhorses have been the MoE editions of gemma 4 and Qwen 3.6 on an old desktop with a single 9060XT with 16GB ram. The problem has always been that every source is vague about Vram/ram requirements.…

33
r/LocalLLaMA community 1d ago

Hypothetically speaking...

Would it not be possible to create crowd sourced, truly open sourced distilled LLMs with a simple wrapper around command line based AI services that exist today? I'm imagining a layer that goes around whatever application people currently use for coding/AI boyfriend that…

24
r/LocalLLaMA community 1d ago

DeepSpec - a deepseek-ai Collection

DeepSpec DeepSpec is a full-stack codebase for training and evaluating draft models for speculative decoding. It contains data preparation utilities, draft model implementations, training code, and evaluation scripts. Released Checkpoints The checkpoints below are the ones used…

26
r/LocalLLaMA community 1d ago

DFlash support merged into llama.cpp

  submitted by   /u/sammcj [link]   [comments]

36

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

I Hate Dario Amodei, and everything he stands for.

Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.

Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!

on Dario’s statement

It’s time, Sam, it’s time.

An NGO for digital freedom of thought

DeepSeek V4, PR merged into llama.cpp !

Qwen3-tts.cpp + Compose Desktop GUI

Amodei: "Open Source Models Will Eat Your Children"

Anthropic's Amodei: "Open Source models [could take us to] a very dangerous place."

Samsung, SK hynix, Micron Sued in US Over Memory Price Fixing

Effect of GLM 5.2 !!

Going from single GPU to dual GPU is nice but not in the way I expected

Instead of decentralized training effort we should build the “One dataset”

Bolt Graphics GPU will have 2 DDR5 laptop DIMM slots

Anyone else end up building a web access layer for local AI agents?

Mellum2 local deployments

NASA testing local LLM inference for future space missions

Kimi and GLM on frontier code

CPU-only GLM 5.2: Epyc and 512GB RAM

Any good uses for a 192 GB DDR3 Server in the LLM world?

Is it ever possible to have a malicious LLM with a backdoor

Deepseek V4 Official Launch to be released mid-July with API price changes

Apparently you can skip entire transformer blocks at load time with minimal performance impact

DeepSeek V4 official version will be launch on mid-July

DeepSeek V4 by am17an · Pull Request #24162 · ggml-org/llama.cpp

GLM 5.2 Q1_S vs Qwen 27B Q8

LibreChat or OpenWebUI ?

MiCA is now part of Hugging Face PEFT

AMD MI210 64GB vs DCU K100 64GB

Update: First Manual Results from Testing Procedural Skill Transfer in Small Models

Minimax M3 vs M2.7

High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps

MLX Fine-Tune Example Guide

I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers.

Locally running mode turns an Image into a Cute Controllable Character you can Play as

Qwen3.6-27B UD Q3 with kv at q8 is quite amazing for simple proof of concepts

NPC Engine Using Local Models

Tensor split performance on low-bandwidth (TB3) eGPUs, and a question

Trying to understand why so many trash fine-tuned models on HuggingFace ...

Success story with MiMo-V2.5-GGUF:UD-Q5_K_XL

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

China Has Matched Anthropic in Cybersecurity, Resetting AI Race

A lot of good M5 Max options available at Apple Refurbished

The number 1 public enemy of open-source.

Script to monitor llama cpp and analyze memory usage

Hypothetically speaking...

DeepSpec - a deepseek-ai Collection

DFlash support merged into llama.cpp