News / #inference Tag Inference 340 articles archived under #inference · RSS Sign in to follow Simon Willison community 1mo ago How fast is 10 tokens per second really? How fast is 10 tokens per second really? Neat little HTML app by Mike Veerman ( source code here ) which simulates LLM token output speeds from 5/second to 800/second. Useful if you see a model advertised as "30 tokens/second" and want to get a feel for what that actually looks… 4 r/LocalLLaMA community 1mo ago RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me. TL;DR: 35B Q4_K_XL, no MTP,… 38 arXiv — Machine Learning research 1mo ago Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches arXiv:2605.18825v1 Announce Type: new Abstract: Prefix caching is a key optimization in Large Language Model (LLM) serving, reusing attention Key-Value (KV) states across requests with shared prompt prefixes to reduce expensive prefill computation. However, its benefit depends… 5 arXiv — Machine Learning research 1mo ago Towards Family-Grouped Hierarchical Federated Learning on Sub-5KB Models: A Feasibility Study of Privacy-Preserving ECG Monitoring for Ultra-Resource-Constrained Wearables arXiv:2605.18862v1 Announce Type: new Abstract: Cardiovascular disease remains the leading cause of death worldwide, and early detection of arrhythmias through continuous ECG monitoring on wearable devices can prevent life-threatening events. Federated Learning (FL) enables… 26 arXiv — Machine Learning research 1mo ago Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target arXiv:2605.18899v1 Announce Type: new Abstract: Generative LLM-based recommenders (LLM-Rec) require continual post-deployment updates, yet deployment logs provide only policy-shaped contextual bandit feedback: outcomes are observed solely for items exposed by a prior serving… 25 arXiv — Machine Learning research 1mo ago KVBuffer: IO-aware Serving for Linear Attention arXiv:2605.19049v1 Announce Type: new Abstract: Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by… 28 arXiv — NLP / Computation & Language research 1mo ago Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges arXiv:2605.19723v1 Announce Type: new Abstract: Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning… 19 r/LocalLLaMA community 1mo ago Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s! Hey r/DeepSeek , Who says we need an H100 cluster or the latest expensive GPUs to run frontier MoE models? I wanted to see how far we could push a single node of consumer legacy hardware, so we spent less than $2,500 total to build a budget machine that successfully runs… 29 arXiv — Machine Learning research 1mo ago Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers arXiv:2605.16438v1 Announce Type: new Abstract: Federated Learning (FL) trains a global model across decentralized clients while preserving data privacy, but at scale it is vulnerable to malicious updates. Byzantine-resilient aggregation methods such as MultiKrum score gradients… 23 arXiv — Machine Learning research 1mo ago Wavelet Flow Matching for Multi-Scale Physics Emulation arXiv:2605.16573v1 Announce Type: new Abstract: Accurate emulation of multi-scale physical systems governed by PDEs demands models that remain stable over long autoregressive rollouts while preserving fine-scale structures. Deterministic emulators produce overly-smoothed… 5 arXiv — NLP / Computation & Language research 1mo ago CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection arXiv:2605.16839v1 Announce Type: new Abstract: Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed… 31 arXiv — NLP / Computation & Language research 1mo ago E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring arXiv:2605.16882v1 Announce Type: new Abstract: Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for… 4 arXiv — NLP / Computation & Language research 1mo ago Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models arXiv:2605.17672v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing… 8 r/MachineLearning community 1mo ago Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P] I’ve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads. The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels.… 13 r/LocalLLaMA community 1mo ago llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs. Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs: Strix Halo (Framework Desktop, ROCm 7.0.2): Q4_K_M: 11.7 → 21.2 tok/s (1.81×) Q8_0: 7.4… 31 r/LocalLLaMA community 1mo ago Configuration Qwen3.6-35b-a3b (12Gb VRAM) Has anyone here tested different KV cache quantizations and compared their performance? I’m currently using the model in Q5_K_M with Q4 KV cache on a 12 GB VRAM GPU. With this setup, I’m offloading about 27 MoE layers to the CPU and getting around 40 tok/s with a 128k total… 38 r/LocalLLaMA community 1mo ago Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp was a good start, BeeLlama worth… 17 Hugging Face Daily Papers research 1mo ago PhysBrain 1.0 Technical Report Abstract PhysBrain 1.0 leverages human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art performance in embodied control tasks through capability-preserving adaptation. AI-generated summary… 28 arXiv — Machine Learning research 1mo ago LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling arXiv:2605.15393v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply… 11 arXiv — NLP / Computation & Language research 1mo ago ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation arXiv:2605.15794v1 Announce Type: new Abstract: We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural… 19 Hacker News — AI on Front Page community 1mo ago How fast is N tokens per second really? Article URL: https://mikeveerman.github.io/tokenspeed/ Comments URL: https://news.ycombinator.com/item?id=48174920 Points: 200 # Comments: 52 21 r/LocalLLaMA community 1mo ago Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and… 4 r/LocalLLaMA community 1mo ago MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware As usual, disclosure first: I'm on the team that built this. Our MiroThinker-1.7-deepresearch and 1.7-mini-deepresearch API went live, mini is a deep research agent built on Qwen3 MoE (30B total, 3B active for mini). Weights on HuggingFace:… 14 r/LocalLLaMA community 1mo ago Using Intel Arc Pro series, any thoughts ? Simple question: Has anyone run two or more of either of these on Ubuntu ? Intel Arc Pro B70 (32 GB) Intel Arc Pro B65 (32 GB) Running llama or vLLM etc., Any thoughts   submitted by   /u/BikerBoyRoy123 [link]   [comments] 13 r/LocalLLaMA community 1mo ago Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm) so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models). https://oval-shell-31c.notion.site/Open-dLLM-Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a I forked… 23 Hugging Face Daily Papers research 1mo ago Aligning Latent Geometry for Spherical Flow Matching in Image Generation Abstract Geodesic flow matching improves image generation by projecting latents onto fixed radius spheres and using spherical linear interpolation instead of linear paths, preserving semantic content through angular components. AI-generated summary Latent flow matching for image… 26 r/LocalLLaMA community 1mo ago is there a centralized website for llm launch commands? I keep on finding myself scrounging wikis and whatnot for everyone's serving commands, is there a site where users could contribute their commands, hardware, runtime and whatnot?   submitted by   /u/onephn [link]   [comments] 33 r/LocalLLaMA community 1mo ago Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions. Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention. 12K context, native system role, sampler defaults from the model card. Cached TTFT around 200ms, sustained 14-15 tok/s. SenseVoiceSmall for STT, Piper for TTS with… 21 r/LocalLLaMA community 1mo ago Important (vision) Qwen3.5 template fix dropped in vllm Sharing this because I personally had some annoying issues and I can confirm this un-fucked them. Basically once you posted an image in the conversation the model went haywire. Not too badly but annoying   submitted by   /u/Dany0 [link]   [comments] 14 r/LocalLLaMA community 1mo ago Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version) In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests. The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context… 28 arXiv — Machine Learning research 1mo ago PreFT: Prefill-only finetuning for efficient inference arXiv:2605.14217v1 Announce Type: new Abstract: Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management… 32 arXiv — Machine Learning research 1mo ago MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification arXiv:2605.14289v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to… 36 arXiv — Machine Learning research 1mo ago MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data arXiv:2605.14364v1 Announce Type: new Abstract: Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal… 13 arXiv — NLP / Computation & Language research 1mo ago GradShield: Alignment Preserving Finetuning arXiv:2605.14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a… 23 Hugging Face Daily Papers research 1mo ago Topology-Preserving Neural Operator Learning via Hodge Decomposition Abstract Physical field equations on geometric meshes are analyzed through Hodge theory to develop a hybrid Eulerian-Lagrangian architecture that improves accuracy and efficiency by separating topological and geometric components. AI-generated summary In this paper, we study… 29 Vercel — AI dev-tools 1mo ago Sort providers by cost, latency, or throughput on AI Gateway You can now sort the providers behind a model by cost, time to first token (TTFT), or throughput (TPS) in AI Gateway . The default provider order blends provider reliability, quality of model output, cost, and speed of response. You can now use sort for explicit control over… 35 vLLM releases dev-tools 1mo ago v0.21.0 Highlights This release features 367 commits from 202 contributors (49 new)! Transformers v4 deprecated : This release formally deprecates transformers v4 support ( #40389 ). Users should migrate to transformers v5. C++20 build requirement : vLLM now requires a C++20-compatible… 23 r/LocalLLaMA community 1mo ago A First Comprehensive Study of TurboQuant: Accuracy and Performance TL;DR from the article: FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving… 27 r/LocalLLaMA community 1mo ago Is there a big gap between Q4 and Q6 on Qwen3.6? I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high. Maybe 65k or up to 100k. I’ve thrown around the idea of a second 3090. But I do already have some… 28 arXiv — Machine Learning research 1mo ago Inference-Time Machine Unlearning via Gated Activation Redirection arXiv:2605.12765v1 Announce Type: new Abstract: Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model… 10 arXiv — Machine Learning research 1mo ago Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle arXiv:2605.13021v1 Announce Type: new Abstract: Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most… 28 Hugging Face Daily Papers research 1mo ago MinT: Managed Infrastructure for Training and Serving Millions of LLMs Abstract MinT is a managed infrastructure system that enables efficient low-rank adaptation training and serving by keeping base models resident and moving lightweight adapter revisions, scaling across multiple dimensions including large model architectures, reduced storage… 28 llama.cpp releases dev-tools 1mo ago b9141 server, webui: accept continue_final_message flag for vLLM API compat ( #23012 ) server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generation_prompt false,… 11 r/LocalLLaMA community 1mo ago 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): Model tok/s Key… 19 Hugging Face Daily Papers research 1mo ago ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging Abstract ORBIT addresses catastrophic forgetting in large language model fine-tuning for generative retrieval by tracking parameter distances and employing weight averaging to maintain model performance. AI-generated summary Despite the rapid advancements in large language model… 7 r/LocalLLaMA community 1mo ago qwen3.6 just stops https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens on opencode. Running with vLLM with… 17 Hugging Face Daily Papers research 1mo ago Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation Abstract Pion is a spectrum-preserving optimizer for large language model training that uses orthogonal equivalence transformations to maintain singular values during weight updates, offering stable performance comparable to standard optimizers. AI-generated summary We introduce… 34 Hugging Face Daily Papers research 1mo ago FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation Abstract FaithfulFaces is a pose-faithful facial identity preservation framework that improves identity consistency in text-to-video generation through pose-shared alignment and explicit Euler angle embeddings. AI-generated summary Identity-preserving text-to-video generation… 38 arXiv — Machine Learning research 1mo ago Rotation-Preserving Supervised Fine-Tuning arXiv:2605.10973v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight… 22 arXiv — Machine Learning research 1mo ago Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies… 17 Page 6 of 7 · 340 articles ← Newer Older →