Tag

Inference

340 articles archived under #inference · RSS

r/MachineLearning community 9d ago

An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

I've been working through the internals of LLM inference and writing up what I learn as an open, in-progress handbook. Just wrapped another chapter on GPU execution and memory internals: why a GPU sits mostly idle during inference, how the memory hierarchy gates throughput, and…

13
r/LocalLLaMA community 10d ago

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Hey peeps, wanted to share what is possible for folks with an inference only single user use case with 1700 in GPU cost. Setup: 4x 5060 ti (16GB) with P2P If you are in the US and you keep an eye on facebook marketplace and places like slickdeals you can find some 5060 ti 16 GB…

30
Hugging Face Daily Papers research 10d ago

Duration Aware Scheduling for ASR Serving Under Workload Drift

Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by…

26
Hugging Face Daily Papers research 10d ago

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Abstract A unified controllable video world model generates videos from a single image while preserving scene structure and transferring to target weather states through specialized parameterization and conditioning techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video…

22
arXiv — Machine Learning research 11d ago

A Hybrid GNN-FEM Framework for Phase-Field Fracture Simulation. Physics-Preserving Hybridization for Generalizable Surrogate Modeling

arXiv:2606.19378v1 Announce Type: new Abstract: Scientific machine learning (SciML) has emerged as a promising approach for accelerating simulations of complex physical systems, yet achieving physically consistent and generalizable predictions for nonlinear, history-dependent…

28
arXiv — Machine Learning research 11d ago

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

arXiv:2606.19679v1 Announce Type: new Abstract: Lifelong knowledge editing aims to efficiently and sequentially update language models over time, as new knowledge becomes available or when the model makes mistakes, while preserving acceptable performance on past knowledge. One…

31
arXiv — Machine Learning research 11d ago

An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling

arXiv:2606.19770v1 Announce Type: new Abstract: We propose an information-theoretic framework for graph novelty generation, which aims to generate data that are distinct from existing patterns while preserving global structural consistency. Our approach embeds data into a latent…

32
arXiv — Machine Learning research 11d ago

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

arXiv:2606.19993v1 Announce Type: new Abstract: We present Activation- and Influence-Aware Ranks (AIR), an SVD-based LLM compression framework that guides each weight matrix's low-rank approximation with a backward-signal influence metric. Starting from the activation-aware…

38
arXiv — NLP / Computation & Language research 11d ago

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

arXiv:2606.19667v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token…

15
arXiv — NLP / Computation & Language research 11d ago

Closing the Calibration Gap in Semantic Caching

arXiv:2606.19719v1 Announce Type: cross Abstract: Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether…

26
arXiv — NLP / Computation & Language research 11d ago

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

arXiv:2606.19808v1 Announce Type: cross Abstract: Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes.…

25
r/LocalLLaMA community 11d ago

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

TLDR: For the first time, I feel relief that they could shut down the cloud services and I would be ok. I got my 4th 3090 and then unsloth dropped the Q2 and Q1. I wrote nothing else here its from CC, so it might be wrong. GLM-5.2 UD-IQ2_M runs across 4×3090 + RAM expert offload…

7
r/LocalLLaMA community 11d ago

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Figured I'd post up a bit of info for anyone else who was thinking about messing with this model on a 3090/4090. Obviously I can't use the nvfp4, but I got it up and running in vLLM using diffusiongemma-26B-A4B-it-AWQ-INT4. Had to run it in a custom vLLM docker they provide for…

34
r/MachineLearning community 11d ago

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU." As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified…

29
r/LocalLLaMA community 11d ago

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

The best i can get from Qwen3.6-27B on my 32GB VRAM (2 x 5060) is ~60 tok/sec gen speed at context size 196608. (sakamakismile text nvfp4). Fp8 kv quantization. NVFP4 kv cache quantization can’t get here fast enough. Reminds me of the time there was this game i couldn’t play on…

38
arXiv — Machine Learning research 12d ago

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

arXiv:2606.18309v1 Announce Type: new Abstract: Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that…

24
arXiv — Machine Learning research 12d ago

SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

arXiv:2606.18384v1 Announce Type: new Abstract: Hierarchical Federated Learning (HFL) enables scalable collaborative model training across distributed devices while preserving data privacy. However, existing HFL client selection mechanisms suffer from a fundamental strategic…

31
arXiv — Machine Learning research 12d ago

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

arXiv:2606.18431v1 Announce Type: new Abstract: LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such…

13
arXiv — Machine Learning research 12d ago

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

arXiv:2606.18518v1 Announce Type: new Abstract: The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution,…

4
arXiv — Machine Learning research 12d ago

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals,…

18
arXiv — Machine Learning research 12d ago

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

arXiv:2606.18627v1 Announce Type: new Abstract: Model merging has emerged as a training-free alternative to multi-task learning, aiming to combine multiple task-specific fine-tuned models into a single multi-task model. Most existing model merging approaches follow the Task…

30
arXiv — NLP / Computation & Language research 12d ago

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

arXiv:2606.18473v1 Announce Type: new Abstract: Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often…

15
r/LocalLLaMA community 12d ago

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Before Fable 5 was shutdown, it helped us optimize our Gemma 4 WebGPU kernels, reaching around 255 tokens per second on my M4 Max. Today, we're releasing the demo and kernels for you to try out yourself. Hope you find it interesting! Links: - Demo (+ kernels):…

9
arXiv — Machine Learning research 13d ago

Performance-Driven Environment Abstraction with Multi-Timescale Learning

arXiv:2606.17377v1 Announce Type: new Abstract: We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We…

8
Hugging Face Daily Papers research 13d ago

Memento: Reconstruct to Remember for Consistent Long Video Generation

Abstract Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-form video generation requires…

17
Hugging Face Daily Papers research 14d ago

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Abstract Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management. Generated…

14
arXiv — Machine Learning research 14d ago

Repeated Bilateral Trade: The Quest for Fairness

arXiv:2606.15369v1 Announce Type: new Abstract: We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the…

34
arXiv — Machine Learning research 14d ago

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

arXiv:2606.15730v1 Announce Type: new Abstract: Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common…

35
arXiv — NLP / Computation & Language research 14d ago

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

arXiv:2606.15266v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains…

16
arXiv — NLP / Computation & Language research 14d ago

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

arXiv:2606.15333v1 Announce Type: new Abstract: LLM unlearning has emerged as a cost-effective alternative to full retraining for removing hazardous knowledge from pretrained models while preserving general utility. Recent RL-based methods such as RULE reformulate unlearning as…

5
arXiv — NLP / Computation & Language research 14d ago

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

arXiv:2606.15335v1 Announce Type: new Abstract: When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and…

17
arXiv — NLP / Computation & Language research 14d ago

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are…

21
Hugging Face Daily Papers research 14d ago

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Abstract Nemotron 3 Ultra is a large-scale language model featuring hybrid Mamba-Attention architecture with 550 billion parameters, achieving high inference throughput and extended context length through specialized training techniques. Generated by…

5
Hugging Face Daily Papers research 14d ago

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as…

32
r/LocalLLaMA community 14d ago

vLLM has a new streaming parser for Qwen3+ available in nightly

The new parser reportedly fixes the issues many were seeing with Qwen3.6-27b stopping mid turn, as well as failing streaming tool calls due to chunk boundaries. The mid turn stopping is especially annoying when trying to use the model for agentic workflows. I've not seen it…

22
NVIDIA Developer Blog official-blog 14d ago

Boosting MoE Training Throughput with Advanced Fusion Kernels

Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable...

36
Hacker News — AI on Front Page community 14d ago

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s) Comments URL: https://news.ycombinator.com/item?id=48542100 Points: 510 # Comments: 255

23
r/LocalLLaMA community 14d ago

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

So every time I pick a model for a feature or random use-case I have I end up having like 12 tabs open — usually OpenRouter for price and context, Artificial Analysis for benchmarks, Design Arena for the UI/frontend Elo if thats relevant, a status/model page for throughput or…

34
r/LocalLLaMA community 14d ago

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

"Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV, needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache)." On the same hardware, generation speeds doubled and VRAM usage dropped significantly…

22
arXiv — Machine Learning research 15d ago

A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework

arXiv:2606.13880v1 Announce Type: new Abstract: Accurate estimation of long-term care transition probabilities is central to disability insurance pricing, reserving, and solvency assessment. Classical actuarial multi-state models commonly rely on Markov, semi-Markov, or…

32
arXiv — Machine Learning research 15d ago

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

arXiv:2606.14668v1 Announce Type: new Abstract: Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a…

36
arXiv — Machine Learning research 15d ago

LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

arXiv:2606.13709v1 Announce Type: cross Abstract: We study controlled post-training refusal suppression in routed MoE and hybrid-MoE foundation models, aiming to increase non-refusal target-response behavior while preserving general capability under a compact intervention…

23
r/LocalLLaMA community 15d ago

Voice-to-voice chatbot update

I've been working on this after hours for a few months continuously improving it. Now at a point where the chatbot is close to real-time (thanks to SSE streaming) and also interruptible while preserving context of what was last said. 100% local and powered by Qwen3.5-397B…

33
r/LocalLLaMA community 15d ago

Qwen 27B Q6/Q8 KV + MTP at 256K on DGX Spark / GB10, tok/s?

Has anyone tested Qwen3.6-27B on NVIDIA DGX Spark / GB10 or similar systems at 256K context? I know it's a dense model, but I'm curious how it performs with MTP enabled. Looking for real numbers with: Q6/Q8 quant Q8 KV cache MTP/speculative decoding 256K context Mainly…

31
r/LocalLLaMA community 15d ago

Xiaomi is now serving MiMo V2.5 at 1000-3000tps using DFlash & Persistent kernel. DFLash model is out, open-source release promised coming soon

https://mimo.xiaomi.com/blog/mimo-tilert-1000tps   submitted by   /u/Dany0 [link]   [comments]

20
r/LocalLLaMA community 16d ago

Yay got Gemma 12B QAT working on old 1080ti (maybe with speculative decoding?)

Pretty happy with 50 tok/sec on this 9 year old GPU. Suggestions to improve anything (speed or quality) very welcome! I'm not 100% sure how to tell if the speculative decoding "model-draft" is helping or not. But hey, it is fast and seems coherent, I'm happy bash…

24
r/LocalLLaMA community 16d ago

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8

  submitted by   /u/SirReal14 [link]   [comments]

22
r/LocalLLaMA community 16d ago

GLM 5.2 is out - open weights to be released next week. How did it do on my one-shot Pac-Man test?

Quick initial impressions: - at 70 tok/s slower than GLM 5.1 - seems to spend more time reasoning - better results with my Pac-Man test The one-shot result is almost functional; apart from the ghosts getting stuck immediately after leaving the ghosts house, I did not notice any…

14
Hacker News — AI on Front Page community 16d ago

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Article URL: https://imil.net/blog/posts/2026/rtx-5080-+-rtx-3090-setup-80+-tok-s-on-qwen-3.6-27b-q8/ Comments URL: https://news.ycombinator.com/item?id=48515454 Points: 228 # Comments: 76

5
r/LocalLLaMA community 17d ago

4× RTX PRO 6000 Blackwell on Water, and the One Card That Wouldn't Behave

Converting four RTX PRO 6000 Blackwell cards to waterblocks, finding a VRM choke loose on the workbench, and getting back to 41k tok/s.   submitted by   /u/thekalki [link]   [comments]

24

An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Duration Aware Scheduling for ASR Serving Under Workload Drift

Holo-World: Unified Camera, Object and Weather Control for Video World Model

A Hybrid GNN-FEM Framework for Phase-Field Fracture Simulation. Physics-Preserving Hybridization for Generalizable Surrogate Modeling

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

Closing the Calibration Gap in Semantic Caching

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Performance-Driven Environment Abstraction with Multi-Timescale Learning

Memento: Reconstruct to Remember for Consistent Long Video Generation

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Repeated Bilateral Trade: The Quest for Fairness

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

VisualClaw: A Real-Time, Personalized Agent for the Physical World

vLLM has a new streaming parser for Qwen3+ available in nightly

Boosting MoE Training Throughput with Advanced Fusion Kernels

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

Voice-to-voice chatbot update

Qwen 27B Q6/Q8 KV + MTP at 256K on DGX Spark / GB10, tok/s?

Xiaomi is now serving MiMo V2.5 at 1000-3000tps using DFlash & Persistent kernel. DFLash model is out, open-source release promised coming soon

Yay got Gemma 12B QAT working on old 1080ti (maybe with speculative decoding?)

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8

GLM 5.2 is out - open weights to be released next week. How did it do on my one-shot Pac-Man test?

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

4× RTX PRO 6000 Blackwell on Water, and the One Card That Wouldn't Behave