r/LocalLLaMA

42 articles archived · Visit source ↗ · RSS

r/LocalLLaMA community 2h ago

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): Model tok/s Key…

19
r/LocalLLaMA community 3h ago

Web-Search is coming to a screeching performance halt as Google shuts down their free search index, and traffic defenders like Cloudflare challenge AI at every gateway. What are our options?

Google is closing its free tier to just 50 domains for site-specific search, and an inheritance date of January 1st, 2027, with no public pricing being listed for advanced searches. Cloudflare's new site-default is to challenge all AI bots attempting to scrape web-information…

32
r/LocalLLaMA community 4h ago

Side Projects.

 Little something I put together to play with for larger contexts than my 9070xt. 8700k, dual P100's, 16gb DDR4, 32gb Optane, Samsung sata SSD. Nothing too fancy. Anyone else do a recent build? How's it working out?   submitted by   /u/apollo_mg [link]  …

15
r/LocalLLaMA community 4h ago

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results are pretty good for a 2018 card. (Bench…

27
r/LocalLLaMA community 5h ago

I made a UI and server for using Anthropic's new Natural Language Autoencoders locally with llama.cpp

Anthropic's first open weight models, Natural Language Autoencoders , are just finetunes of popular open weight models. They do not modify architecture and modeling code so inference with llama.cpp is mostly trivial. I packaged every feature of NLAs (namely activation…

34
r/LocalLLaMA community 6h ago

Efficient pretraining with token superposition by Nous Research

  submitted by   /u/de4dee [link]   [comments]

14
r/LocalLLaMA community 6h ago

New models possibly from Baidu (ERNIE) this month?

Tweets of screenshots: https://xcancel.com/ErnieforDevs/status/2049516018557706650#m https://xcancel.com/Baidu_Inc/status/2049682555809788282#m Baidu Create 2026 : https://www.youtube.com/watch?v=9WD9lmHf6CU (Somebody please extract & summarize the contents of this 2 1/2 hour…

4
r/LocalLLaMA community 6h ago

DramaBox - Most Expressive Voice model ever based on LTX 2.3

The Most Expressive Voice Model. Github: https://github.com/resemble-ai/DramaBox HF Model: https://huggingface.co/ResembleAI/Dramabox HF Space: https://huggingface.co/spaces/ResembleAI/Dramabox   submitted by   /u/manmaynakhashi [link]   [comments]

22
r/LocalLLaMA community 6h ago

Who is your favourite quant publisher and why?

Hey everyone, I’ve been a big fan of Unsloth for several reasons: They publish models ASAP after release. They usually offer the lowest PPL. Their website has tons of helpful tutorials and documentation. Recently, I stumbled upon this Reddit thread suggesting to try out an Apex…

10
r/LocalLLaMA community 7h ago

sensenova/SenseNova-U1-A3B-MoT · Hugging Face

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture 🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental…

37
r/LocalLLaMA community 7h ago

Sipeed's K3 RISC-V SBCs can run 30B-parameter LLMs 60 TOPS (INT4), Supports BF16/FP16/INT4

https://wccftech.com/sipeed-crams-32gb-lpddr5-60-tops-npu-compact-risc-v-board-hits-15-tokens-s-ai-llms/   submitted by   /u/MundanePercentage674 [link]   [comments]

19
r/LocalLLaMA community 9h ago

llama.cpp docker images to run MTP models

This is follow up from previous post: https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/ There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but…

15
r/LocalLLaMA community 9h ago

qwen3.6 just stops

https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens on opencode. Running with vLLM with…

17
r/LocalLLaMA community 10h ago

TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui).

Hi all, I have been making a lot of updates to my project, and I wanted to share them here. TextGen (previously text-generation-webui, also known as my username oobabooga or ooba) has been in development since December 2022, before LLaMa and llama.cpp existed. In the last two…

32
r/LocalLLaMA community 11h ago

AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

We introduce Ovis2.6-80B-A3B , the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal…

31
r/LocalLLaMA community 11h ago

Do not fall into the trap of chasing the next scale or upgrade.

I mean; don't get me wrong, I love me some improvements and enhancements and it keeps on giving... and with MTP making its way to llama.cpp soon, a lot of you who aren't already running custom compiles are about to get a boost in inference speed, and your workflows will feel…

18
r/LocalLLaMA community 13h ago

server, webui: support continue generation on reasoning models by ServeurpersoCom · Pull Request #22727 · ggml-org/llama.cpp

now you can CONTINUE   submitted by   /u/jacek2023 [link]   [comments]

17
r/LocalLLaMA community 14h ago

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens?

Xiaomi open-sourced MiMo-V2.5-Pro. 1.02 trillion parameters, 42B active (MoE), 1M context, MIT license. On paper, this is exciting. In practice, I'm stuck on the math. What I've been doing with it I've been running V2.5-Pro via the API through Claude Code for autonomous coding…

13
r/LocalLLaMA community 15h ago

Does THINKING MODE significantly improve translation?

Between a solid model from Qwen or Gemma 4, when translating a text, does "thinking mode" significantly boost the quality of the translation, or is the difference negligible?   submitted by   /u/Sostrene_Blue [link]   [comments]

27
r/LocalLLaMA community 15h ago

How many of you tried BeeLlama.cpp? How's it? Agentic coding possible with 8GB VRAM?

We'll be getting those features(check bottom link) on mainline soon or later anyway. But for now this fork could be useful to see the full potential of our poor GPUs(and also big, large GPUs). Any 8GB VRAM(and 32GB RAM) folks already doing Agentic coding with models(@ Q4 at…

12
r/LocalLLaMA community 20h ago

I've seen a lot of folks ask "can local LLMs actually do anything useful?"

And I'm here to share my experience. The answer is resoundingly 'yes'. Let me start with the local model I use every day in my AI harness: embedding models. I'm using an embedding model to give my AI's persistent memory system a semantic search protocol that makes its memory…

37
r/LocalLLaMA community 21h ago

Save and invest your money for future rigs

I have long had the itch to build out. Plan was for a 1tb genoa this year and what would have been a $6,000 affair is now a $30,000 affair. So I held out for the mac M5 studio ultra and then looks like Apple is gonna get hit with shortages, they have pushed from Q1 to Q2 and now…

38
r/LocalLLaMA community 21h ago

AntAngelMed - 100a6b Healthcare LLM

  submitted by   /u/Zc5Gwu [link]   [comments]

38
r/LocalLLaMA community 23h ago

Fine-Tuning TranslateGemma-4B to improve bi-directional English & Welsh translations on an H200 GPU!

Open source repo: https://github.com/grctest/finetuned-gemmatranslate-cy 5% of the fine-tuning took 40 minutes and cost a couple dollars to prove the process works. Looking forwards to Flash Attention v4 to leave beta, to test fine-tuning performance on a B200 on the cloud,…

16
r/LocalLLaMA community 1d ago

I got a real transformer language model running locally on a stock Game Boy Color!

No phone, PC, Wi-Fi, link cable, or cloud inference. • The cartridge boots a ROM, and the GBC runs the model itself. • The model is Andrej Karpathy’s TinyStories-260K, converted to INT8 weights with fixed-point math so it can run without floating point. • Built with GBDK-2020 as…

19
r/LocalLLaMA community 1d ago

My First Official AI Research Paper Accepted on SSRN

https://preview.redd.it/oz4vpoxdfs0h1.jpg?width=910&format=pjpg&auto=webp&s=fa4c91aad0e3c56850fbfc06099e9c4095712bbd Today, my research paper “Stable Training with Adaptive Momentum (STAM)” was officially accepted on SSRN — marking my first documented and official publication as…

15
r/LocalLLaMA community 1d ago

High VRAM local coding model — still Qwen 3.6 27B?

I’ve been using Qwen 3.6 27B and it’s amazing. Not exactly your Opus replacement, but great for small tasks and checking work. But if you had 224GB of VRAM, would it still be your choice? Or is there something you consider better in the 100+B range (GPT-OSS, Deepseek, etc)…

7
r/LocalLLaMA community 1d ago

I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC

I’ve been working on a tool called Derpy Turtle: The Kokoro Trainer. It started as a random-walk experiment for Kokoro voices, but it has grown into its own thing: a Windows GUI for creating better local voice outputs by combining Kokoro voice search with RVC voice conversion.…

9
r/LocalLLaMA community 1d ago

Is using vLLM actually worth it if you aren't serving the model to other people?

So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I happen to have an AMD GPU. The…

4
r/LocalLLaMA community 1d ago

Dad why is my sisters name Lora?

  submitted by   /u/rwitz4 [link]   [comments]

35
r/LocalLLaMA community 1d ago

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter…

6
r/LocalLLaMA community 1d ago

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Hey fellow Llamas, keeping it short. We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from the RTX 3090 post a couple weeks back , now running on the consumer AMD APU class. Repo:…

22
r/LocalLLaMA community 1d ago

Needle: We Distilled Gemini Tool Calling Into a 26M Model

We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted…

4
r/LocalLLaMA community 1d ago

i built a little free mobile app that lets you generate your ai slop wrapper apps

  submitted by   /u/xSnoozy [link]   [comments]

16
r/LocalLLaMA community 1d ago

Agentic harness for theoretical physics research

Hi everyone, at Hugging Face we've been developing agentic harnesses for various domains and today we're releasing physics-intern to tackle research-level problems in theoretical physics. It's a multi-agent framework which we designed to mimic the research process and decomposes…

16
r/LocalLLaMA community 1d ago

New Qwen3.6 27b Autoround Quant (int4) Best Recipe

I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm. I decided to use a similar Autoround recipe but use the "autorund-best" preset instead, it uses more iterations to…

34
r/LocalLLaMA community 1d ago

Let's build claude code from scratch!

So, I made this video about how to create claude code from scratch. Here's the video: https://youtu.be/8pDfgBEy8bg and Github: https://github.com/CohleM/nanoclaude Feedback is extremely appreciated.   submitted by   /u/RoyalMaterial9614 [link]   [comments]

10
r/LocalLLaMA community 1d ago

1M datasets on HF !

This community is gold ! Congrats for pushing AI forward together with open datasets !   submitted by   /u/qlhoest [link]   [comments]

33
r/LocalLLaMA community 1d ago

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. Autocomplete : bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L Agentic : unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL Why these models: Qwen2.5 is still the best model for infill…

9
r/LocalLLaMA community 1d ago

MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more

I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment. And some architectures like Qwen3.6 27B have super weird patterns that can get genuinely…

14
r/LocalLLaMA community 1d ago

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across each of 11 categories Models:…

17
r/LocalLLaMA community 1d ago

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

now you can evaluate your models at home, sounds like a perfect tool to compare quants and finetunes Datasets: AIME, AIME2025, GSM8K, GPQA   submitted by   /u/jacek2023 [link]   [comments]

15

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Web-Search is coming to a screeching performance halt as Google shuts down their free search index, and traffic defenders like Cloudflare challenge AI at every gateway. What are our options?

Side Projects.

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

I made a UI and server for using Anthropic's new Natural Language Autoencoders locally with llama.cpp

Efficient pretraining with token superposition by Nous Research

New models possibly from Baidu (ERNIE) this month?

DramaBox - Most Expressive Voice model ever based on LTX 2.3

Who is your favourite quant publisher and why?

sensenova/SenseNova-U1-A3B-MoT · Hugging Face

Sipeed's K3 RISC-V SBCs can run 30B-parameter LLMs 60 TOPS (INT4), Supports BF16/FP16/INT4

llama.cpp docker images to run MTP models

qwen3.6 just stops

TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui).

AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

Do not fall into the trap of chasing the next scale or upgrade.

server, webui: support continue generation on reasoning models by ServeurpersoCom · Pull Request #22727 · ggml-org/llama.cpp

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens?

Does THINKING MODE significantly improve translation?

How many of you tried BeeLlama.cpp? How's it? Agentic coding possible with 8GB VRAM?

I've seen a lot of folks ask "can local LLMs actually do anything useful?"

Save and invest your money for future rigs

AntAngelMed - 100a6b Healthcare LLM

Fine-Tuning TranslateGemma-4B to improve bi-directional English & Welsh translations on an H200 GPU!

I got a real transformer language model running locally on a stock Game Boy Color!

My First Official AI Research Paper Accepted on SSRN

High VRAM local coding model — still Qwen 3.6 27B?

I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC

Is using vLLM actually worth it if you aren't serving the model to other people?

Dad why is my sisters name Lora?

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Needle: We Distilled Gemini Tool Calling Into a 26M Model

i built a little free mobile app that lets you generate your ai slop wrapper apps

Agentic harness for theoretical physics research

New Qwen3.6 27b Autoround Quant (int4) Best Recipe

Let's build claude code from scratch!

1M datasets on HF !

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp