Tag

Inference

340 articles archived under #inference · RSS

Simon Willison community 1mo ago

How fast is 10 tokens per second really?

How fast is 10 tokens per second really? Neat little HTML app by Mike Veerman ( source code here ) which simulates LLM token output speeds from 5/second to 800/second. Useful if you see a model advertised as "30 tokens/second" and want to get a feel for what that actually looks…

4
r/LocalLLaMA community 1mo ago

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me. TL;DR: 35B Q4_K_XL, no MTP,…

38
arXiv — Machine Learning research 1mo ago

Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

arXiv:2605.18825v1 Announce Type: new Abstract: Prefix caching is a key optimization in Large Language Model (LLM) serving, reusing attention Key-Value (KV) states across requests with shared prompt prefixes to reduce expensive prefill computation. However, its benefit depends…

5
arXiv — Machine Learning research 1mo ago

Towards Family-Grouped Hierarchical Federated Learning on Sub-5KB Models: A Feasibility Study of Privacy-Preserving ECG Monitoring for Ultra-Resource-Constrained Wearables

arXiv:2605.18862v1 Announce Type: new Abstract: Cardiovascular disease remains the leading cause of death worldwide, and early detection of arrhythmias through continuous ECG monitoring on wearable devices can prevent life-threatening events. Federated Learning (FL) enables…

26
arXiv — Machine Learning research 1mo ago

Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

arXiv:2605.18899v1 Announce Type: new Abstract: Generative LLM-based recommenders (LLM-Rec) require continual post-deployment updates, yet deployment logs provide only policy-shaped contextual bandit feedback: outcomes are observed solely for items exposed by a prior serving…

25
arXiv — Machine Learning research 1mo ago

KVBuffer: IO-aware Serving for Linear Attention

arXiv:2605.19049v1 Announce Type: new Abstract: Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by…

28
arXiv — NLP / Computation & Language research 1mo ago

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

arXiv:2605.19723v1 Announce Type: new Abstract: Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning…

19
r/LocalLLaMA community 1mo ago

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Hey r/DeepSeek , Who says we need an H100 cluster or the latest expensive GPUs to run frontier MoE models? I wanted to see how far we could push a single node of consumer legacy hardware, so we spent less than $2,500 total to build a budget machine that successfully runs…

29
arXiv — Machine Learning research 1mo ago

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

arXiv:2605.16438v1 Announce Type: new Abstract: Federated Learning (FL) trains a global model across decentralized clients while preserving data privacy, but at scale it is vulnerable to malicious updates. Byzantine-resilient aggregation methods such as MultiKrum score gradients…

23
arXiv — Machine Learning research 1mo ago

Wavelet Flow Matching for Multi-Scale Physics Emulation

arXiv:2605.16573v1 Announce Type: new Abstract: Accurate emulation of multi-scale physical systems governed by PDEs demands models that remain stable over long autoregressive rollouts while preserving fine-scale structures. Deterministic emulators produce overly-smoothed…

5
arXiv — NLP / Computation & Language research 1mo ago

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

arXiv:2605.16839v1 Announce Type: new Abstract: Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed…

31
arXiv — NLP / Computation & Language research 1mo ago

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

arXiv:2605.16882v1 Announce Type: new Abstract: Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for…

4
arXiv — NLP / Computation & Language research 1mo ago

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

arXiv:2605.17672v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing…

8
r/MachineLearning community 1mo ago

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

I’ve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads. The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels.…

13
r/LocalLLaMA community 1mo ago

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs. Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs: Strix Halo (Framework Desktop, ROCm 7.0.2): Q4_K_M: 11.7 → 21.2 tok/s (1.81×) Q8_0: 7.4…

31
r/LocalLLaMA community 1mo ago

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

Has anyone here tested different KV cache quantizations and compared their performance? I’m currently using the model in Q5_K_M with Q4 KV cache on a 12 GB VRAM GPU. With this setup, I’m offloading about 27 MoE layers to the CPU and getting around 40 tok/s with a 128k total…

38
r/LocalLLaMA community 1mo ago

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp was a good start, BeeLlama worth…

17
Hugging Face Daily Papers research 1mo ago

PhysBrain 1.0 Technical Report

Abstract PhysBrain 1.0 leverages human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art performance in embodied control tasks through capability-preserving adaptation. AI-generated summary…

28
arXiv — Machine Learning research 1mo ago

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

arXiv:2605.15393v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply…

11
arXiv — NLP / Computation & Language research 1mo ago

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

arXiv:2605.15794v1 Announce Type: new Abstract: We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural…

19
Hacker News — AI on Front Page community 1mo ago

How fast is N tokens per second really?

Article URL: https://mikeveerman.github.io/tokenspeed/ Comments URL: https://news.ycombinator.com/item?id=48174920 Points: 200 # Comments: 52

21
r/LocalLLaMA community 1mo ago

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and…

4
r/LocalLLaMA community 1mo ago

MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware

As usual, disclosure first: I'm on the team that built this. Our MiroThinker-1.7-deepresearch and 1.7-mini-deepresearch API went live, mini is a deep research agent built on Qwen3 MoE (30B total, 3B active for mini). Weights on HuggingFace:…

14
r/LocalLLaMA community 1mo ago

Using Intel Arc Pro series, any thoughts ?

Simple question: Has anyone run two or more of either of these on Ubuntu ? Intel Arc Pro B70 (32 GB) Intel Arc Pro B65 (32 GB) Running llama or vLLM etc., Any thoughts   submitted by   /u/BikerBoyRoy123 [link]   [comments]

13
r/LocalLLaMA community 1mo ago

Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models). https://oval-shell-31c.notion.site/Open-dLLM-Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a I forked…

23
Hugging Face Daily Papers research 1mo ago

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Abstract Geodesic flow matching improves image generation by projecting latents onto fixed radius spheres and using spherical linear interpolation instead of linear paths, preserving semantic content through angular components. AI-generated summary Latent flow matching for image…

26
r/LocalLLaMA community 1mo ago

is there a centralized website for llm launch commands?

I keep on finding myself scrounging wikis and whatnot for everyone's serving commands, is there a site where users could contribute their commands, hardware, runtime and whatnot?   submitted by   /u/onephn [link]   [comments]

33
r/LocalLLaMA community 1mo ago

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention. 12K context, native system role, sampler defaults from the model card. Cached TTFT around 200ms, sustained 14-15 tok/s. SenseVoiceSmall for STT, Piper for TTS with…

21
r/LocalLLaMA community 1mo ago

Important (vision) Qwen3.5 template fix dropped in vllm

Sharing this because I personally had some annoying issues and I can confirm this un-fucked them. Basically once you posted an image in the conversation the model went haywire. Not too badly but annoying   submitted by   /u/Dany0 [link]   [comments]

14
r/LocalLLaMA community 1mo ago

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests. The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context…

28
arXiv — Machine Learning research 1mo ago

PreFT: Prefill-only finetuning for efficient inference

arXiv:2605.14217v1 Announce Type: new Abstract: Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management…

32
arXiv — Machine Learning research 1mo ago

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

arXiv:2605.14289v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to…

36
arXiv — Machine Learning research 1mo ago

MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data

arXiv:2605.14364v1 Announce Type: new Abstract: Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal…

13
arXiv — NLP / Computation & Language research 1mo ago

GradShield: Alignment Preserving Finetuning

arXiv:2605.14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a…

23
Hugging Face Daily Papers research 1mo ago

Topology-Preserving Neural Operator Learning via Hodge Decomposition

Abstract Physical field equations on geometric meshes are analyzed through Hodge theory to develop a hybrid Eulerian-Lagrangian architecture that improves accuracy and efficiency by separating topological and geometric components. AI-generated summary In this paper, we study…

29
Vercel — AI dev-tools 1mo ago

Sort providers by cost, latency, or throughput on AI Gateway

You can now sort the providers behind a model by cost, time to first token (TTFT), or throughput (TPS) in AI Gateway . The default provider order blends provider reliability, quality of model output, cost, and speed of response. You can now use sort for explicit control over…

35
vLLM releases dev-tools 1mo ago

v0.21.0

Highlights This release features 367 commits from 202 contributors (49 new)! Transformers v4 deprecated : This release formally deprecates transformers v4 support ( #40389 ). Users should migrate to transformers v5. C++20 build requirement : vLLM now requires a C++20-compatible…

23
r/LocalLLaMA community 1mo ago

A First Comprehensive Study of TurboQuant: Accuracy and Performance

TL;DR from the article: FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving…

27
r/LocalLLaMA community 1mo ago

Is there a big gap between Q4 and Q6 on Qwen3.6?

I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high. Maybe 65k or up to 100k. I’ve thrown around the idea of a second 3090. But I do already have some…

28
arXiv — Machine Learning research 1mo ago

Inference-Time Machine Unlearning via Gated Activation Redirection

arXiv:2605.12765v1 Announce Type: new Abstract: Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model…

10
arXiv — Machine Learning research 1mo ago

Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle

arXiv:2605.13021v1 Announce Type: new Abstract: Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most…

28
Hugging Face Daily Papers research 1mo ago

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Abstract MinT is a managed infrastructure system that enables efficient low-rank adaptation training and serving by keeping base models resident and moving lightweight adapter revisions, scaling across multiple dimensions including large model architectures, reduced storage…

28
llama.cpp releases dev-tools 1mo ago

b9141

server, webui: accept continue_final_message flag for vLLM API compat ( #23012 ) server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generation_prompt false,…

11
r/LocalLLaMA community 1mo ago

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): Model tok/s Key…

19
Hugging Face Daily Papers research 1mo ago

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

Abstract ORBIT addresses catastrophic forgetting in large language model fine-tuning for generative retrieval by tracking parameter distances and employing weight averaging to maintain model performance. AI-generated summary Despite the rapid advancements in large language model…

7
r/LocalLLaMA community 1mo ago

qwen3.6 just stops

https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens on opencode. Running with vLLM with…

17
Hugging Face Daily Papers research 1mo ago

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Abstract Pion is a spectrum-preserving optimizer for large language model training that uses orthogonal equivalence transformations to maintain singular values during weight updates, offering stable performance comparable to standard optimizers. AI-generated summary We introduce…

34
Hugging Face Daily Papers research 1mo ago

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Abstract FaithfulFaces is a pose-faithful facial identity preservation framework that improves identity consistency in text-to-video generation through pose-shared alignment and explicit Euler angle embeddings. AI-generated summary Identity-preserving text-to-video generation…

38
arXiv — Machine Learning research 1mo ago

Rotation-Preserving Supervised Fine-Tuning

arXiv:2605.10973v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight…

22
arXiv — Machine Learning research 1mo ago

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies…

17

How fast is 10 tokens per second really?

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

Towards Family-Grouped Hierarchical Federated Learning on Sub-5KB Models: A Feasibility Study of Privacy-Preserving ECG Monitoring for Ultra-Resource-Constrained Wearables

Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

KVBuffer: IO-aware Serving for Linear Attention

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

Wavelet Flow Matching for Multi-Scale Physics Emulation

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

PhysBrain 1.0 Technical Report

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

How fast is N tokens per second really?

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware

Using Intel Arc Pro series, any thoughts ?

Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

is there a centralized website for llm launch commands?

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

Important (vision) Qwen3.5 template fix dropped in vllm

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

PreFT: Prefill-only finetuning for efficient inference

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data

GradShield: Alignment Preserving Finetuning

Topology-Preserving Neural Operator Learning via Hodge Decomposition

Sort providers by cost, latency, or throughput on AI Gateway

v0.21.0

A First Comprehensive Study of TurboQuant: Accuracy and Performance

Is there a big gap between Q4 and Q6 on Qwen3.6?

Inference-Time Machine Unlearning via Gated Activation Redirection

Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

b9141

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

qwen3.6 just stops

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Rotation-Preserving Supervised Fine-Tuning

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies