Tag

Inference

340 articles archived under #inference · RSS

arXiv — Machine Learning research 25d ago

DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum

arXiv:2606.05435v1 Announce Type: new Abstract: Differentially private stochastic gradient descent (DP-SGD) has become the standard framework for privacy-preserving machine learning, yet its reliance on a fixed gradient clipping threshold to limit sensitivity remains a…

12
arXiv — NLP / Computation & Language research 25d ago

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

arXiv:2606.05561v1 Announce Type: new Abstract: Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve…

34
r/LocalLLaMA community 25d ago

Finally finished my LLM server: EPYC 9575F, 4× RTX 3090 (96GB VRAM), 768GB ECC RAM

Took a while, but Nalthis is finally up and assembled. Specs: Supermicro H13SSL-N AMD EPYC 9575F (64C/128T Zen 5) 768GB DDR5-5600 ECC RDIMM 4× RTX 3090 (96GB VRAM total) 1× 2TB NVMe OS 2× 3.94TB NVMe data 2050W ATX 3.1 PSU Corsair 9000D Planned use: vLLM - high throughput small…

11
r/LocalLLaMA community 25d ago

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

I’m posting this as a warning for anyone building multi-GPU local LLM rigs with older workstation/HEDT boards. My setup (Node #04) Gigabyte X399 Designare EX Threadripper 1950X 128GB DDR4 4x RTX 3090 10GbE TP-Link/Aquantia NIC llama.cpp NCCL build vLLM for safetensors models I…

15
r/LocalLLaMA community 25d ago

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

The KV-cache quant race just got more interesting. Huawei just open-sourced KVarN , a KV-cache quantization method under Apache 2.0, drops into vLLM with one flag. Posting because the tradeoff it's claiming is genuinely different from what's already in the stack, and I'd like to…

20
Hugging Face Daily Papers research 25d ago

Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning

Abstract DeepMDMD combines deep learning with Koopman theory to learn latent coordinates while enforcing algebraic constraints, enabling stable forecasting and coherent structure preservation in complex dynamical systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Koopman…

32
r/LocalLLaMA community 26d ago

MTP has no impact on my Qwen3.6 MoE performance

Hello I have an rtx 5060Ti and I tried running unsloth's Qwen3.6-35B GGUF with MTP. However in both cases I have around 60 tok/s. Here are my flags: llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --alias unsloth/Qwen3.6…

35
arXiv — Machine Learning research 26d ago

Bayes-Sufficient Representations in Supervised Learning

arXiv:2606.04045v1 Announce Type: new Abstract: Representation learning is often described as preserving the information in an input that is relevant for prediction. This work asks what relevance means for a fixed supervised decision problem. A representation is defined to be…

14
arXiv — Machine Learning research 26d ago

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

arXiv:2606.04238v1 Announce Type: new Abstract: Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for…

22
arXiv — Machine Learning research 26d ago

Federated Learning for Multi-Center Sepsis Early Prediction with Privacy-Preserving

arXiv:2606.04338v1 Announce Type: new Abstract: Privacy-sensitive and distributed characteristics of multi-center medical data bring severe obstacles to centralized modeling for accurate early prediction of sepsis. Federated learning (FL) has attracted growing attention as a…

6
arXiv — Machine Learning research 26d ago

Revisiting Privacy Amplification by Subsampling in Selective Release DPSGD

arXiv:2606.04384v1 Announce Type: new Abstract: Machine learning's reliance on sensitive data necessitates privacy-preserving techniques like Differentially Private Stochastic Gradient Descent (DPSGD). However, DPSGD suffers from substantial utility degradation and slow…

28
arXiv — NLP / Computation & Language research 26d ago

SANE Schema-aware Natural-language Evaluation of Biological Data

arXiv:2606.04500v1 Announce Type: new Abstract: High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a…

23
arXiv — NLP / Computation & Language research 26d ago

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

arXiv:2606.04646v1 Announce Type: new Abstract: Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized…

6
Hugging Face Daily Papers research 26d ago

KletterMix: Climbing Toward High-Quality German Pretraining Data

Abstract A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks. Generated by…

28
r/LocalLLaMA community 27d ago

Another shout out to llama.cpp build b9455 2x3090

https://preview.redd.it/xyvtkzwr005h1.png?width=645&format=png&auto=webp&s=aebd5b5ef79255247c9bc91fb69d8423a0c61f86 As you guys know, the next highest quant is Unsloth's /Qwen3.6-27B-UD-Q8_K_XL.gguf. With llama.cpp before, i was getting 30-50 tk/s. vllm was kicking llama's ass…

4
arXiv — Machine Learning research 27d ago

Geometry-Aware Tabular Diffusion

arXiv:2606.02607v1 Announce Type: new Abstract: Tabular synthesis is critical for privacy-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture inter-column relationships. We introduce Geometry-Aware Tabular Diffusion (GATD), which…

32
arXiv — Machine Learning research 27d ago

Fast Unlearning at Scale via Margin Self-Correction

arXiv:2606.02920v1 Announce Type: new Abstract: Language-model unlearning updates a trained model to behave as if it had not seen selected training examples, while preserving utility and avoiding costly retraining. Existing approaches typically fine-tune the pretrained model…

26
arXiv — Machine Learning research 27d ago

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

arXiv:2606.03070v1 Announce Type: new Abstract: Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected…

17
arXiv — NLP / Computation & Language research 27d ago

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

arXiv:2606.03399v1 Announce Type: new Abstract: While large language models (LLMs) are increasingly used for clinical applications, many existing pipelines require sending raw sensitive health information to remote servers for processing, which heightens the risk of privacy…

4
arXiv — Machine Learning research 28d ago

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

arXiv:2606.00132v1 Announce Type: new Abstract: While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired during pretraining. Existing forgetting aware methods typically seek safer updates through…

8
arXiv — Machine Learning research 28d ago

Multi-Objective Reference-Aligned Machine Unlearning

arXiv:2606.00399v1 Announce Type: new Abstract: Machine unlearning aims to remove the influence of specific training samples while preserving the model's utility. Existing single-objective approaches, such as gradient ascent or random relabeling, often induce catastrophic…

28
arXiv — Machine Learning research 28d ago

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

arXiv:2606.00437v1 Announce Type: new Abstract: Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations…

19
arXiv — NLP / Computation & Language research 28d ago

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

arXiv:2606.00356v1 Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these…

4
arXiv — NLP / Computation & Language research 28d ago

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

arXiv:2606.00724v1 Announce Type: new Abstract: Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in…

28
arXiv — NLP / Computation & Language research 28d ago

Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

arXiv:2606.01240v1 Announce Type: new Abstract: The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving…

36
Hugging Face Daily Papers research 28d ago

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Abstract VideoMLA reduces memory usage in video diffusion models by replacing per-head keys and values with shared low-rank content and decoupled 3D-RoPE positional keys, maintaining quality while achieving significant compression and improved throughput. AI-generated summary…

19
llama.cpp releases dev-tools 28d ago

b9460

llama: limit max outputs of llama_context ( #23861 ) llama: save more VRAM by reserving n_outputs == n_seqs when possible add n_outputs_per_seq move n_outputs_max to server-context change ubatch to batch everywhere macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…

15
r/LocalLLaMA community 28d ago

For Ling-2.6-1T, what would make the size feel justified first: quality per token, local serving reality, or long context stability?

The first question I have about Ling-2.6-1T is not “is the model card impressive?” It is whether the boring trade-off makes sense. It is an open-sourced Ant/InclusionAI flagship with about 1T total params / 63B activated params, up to 1M native context, and 256K currently…

21
r/LocalLLaMA community 28d ago

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput. The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200:…

24
arXiv — Machine Learning research 29d ago

ScaleMAP: Preserving Local Density and Neighborhood Structure in Low-Dimensional Embeddings

arXiv:2605.30597v1 Announce Type: new Abstract: Nonlinear dimensionality-reduction methods such as UMAP and PaCMAP adaptively normalize local distances during graph construction, erasing neighborhood scale from the data. This distorts more than relative cluster sizes: sparse…

9
arXiv — Machine Learning research 29d ago

The Fast Mixing Mechanism for Differential Privacy

arXiv:2605.30600v1 Announce Type: new Abstract: Randomized sketching is a central tool for compressing large-scale optimization problems while preserving accuracy. In particular, sketches that are based on structured matrices, such as the Hadamard matrix, can be applied…

28
arXiv — Machine Learning research 29d ago

Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints

arXiv:2605.30825v1 Announce Type: new Abstract: Unlearning in diffusion models aims to remove undesirable data or concepts while preserving the utility of pretrained models -- two fundamentally conflicting objectives. We propose a principled constrained optimization framework…

21
arXiv — Machine Learning research 29d ago

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

arXiv:2605.30873v1 Announce Type: new Abstract: Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user…

35
arXiv — Machine Learning research 29d ago

An Efficient and Scalable Graph Condensation with Structure-Preserving

arXiv:2605.31016v1 Announce Type: new Abstract: Graph condensation (GC) is pivotal for enabling Graph Neural Networks (GNNs) deployment in resource-constrained scenarios by compressing large-scale graphs into compact synthetic counterparts. Existing GC methods commonly suffer…

21
r/MachineLearning community 29d ago

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama) [P]

Hey! I'm a CS student and I got tired of not being able to compare MLX inference engines properly — every benchmark out there is either made by the engine's own developers, runs on an M3 Ultra nobody has, or just shows tok/s with zero context. So I built mlx-Chronos — a small…

11
r/LocalLLaMA community 1mo ago

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

Under $1000 for 32gb vram from 2023, and ~300 watts draw... and this thing is outperforming the latest pick-your-vendor $5k mini pcs from 2026. So.. next question is can I make it squeeze 150 t/s with the same q4xl on cuda 13.3 this weekend. Anyone try it yet?   submitted by…

13
r/LocalLLaMA community 1mo ago

MINISFORUM UM790 Pro

Hi, Anyone tried this mini pc with llama.cpp or vLLM ? Thi what I have seen: "Budget and Compact Hardware MINISFORUM UM790 Pro ($351) is perhaps the most striking data point in the current local AI landscape." Is it true?   submitted by   /u/codeltd [link]  …

19
NVIDIA Developer Blog official-blog 1mo ago

DynoSim: Simulating the Pareto Frontier

Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...

22
r/LocalLLaMA community 1mo ago

Step-3.7-Flash-NVFP4 thinking for many minutes

Anyone else seeing Step-3.7-Flash-NVFP4 thinking for many minutes? I'm using it with Cline and can see it thinking for in some cases 14 minutes with vLLM reporting generation of 90 tokens/s every 10s.   submitted by   /u/NaiRogers [link]   [comments]

19
r/LocalLLaMA community 1mo ago

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Hey guys, I spent the last few weeks benchmarking Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B locally GGUF, FP8 using both vLLM and llama.cpp . MTP is the inference trick every major lab is quietly adding to their stack right now and the results genuinely…

19
r/LocalLLaMA community 1mo ago

vLLM PR adding native HIP W4A16 kernel was merged

The performance increase introduced by the PR is awesome. Makes my ROCm rig a lot more useful. Numbers from the PR: Kernel dtype max-num-seqs=8 max-num-seqs=32 Triton W4A16 bf16 82.4 tk/s - Triton W4A16 fp16 83.2 tk/s - ExLlama (no bf16) fp16 255.0 tk/s 382.5 tk/s RDNA3 W4A16…

27
r/LocalLLaMA community 1mo ago

Step 3.7 Flash Config + Early Data on 2x RTX 6000's

Setup Step 3.7 Flash on two Blackwell RTX Pro 6000's and got it running and recorded the configs and settings as well as early data and readings like tokens per second on general inference. Running extended bench tests now just wanted to get this to folks early. It's past…

21
arXiv — NLP / Computation & Language research 1mo ago

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

arXiv:2605.29000v1 Announce Type: new Abstract: Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes…

27
arXiv — NLP / Computation & Language research 1mo ago

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

arXiv:2605.29379v1 Announce Type: new Abstract: We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's…

38
arXiv — NLP / Computation & Language research 1mo ago

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

arXiv:2605.29555v1 Announce Type: new Abstract: As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a…

31
Hugging Face Daily Papers research 1mo ago

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Abstract Language models struggle with managing long-term information through contextual belief management, which involves updating, preserving, and filtering relevant information, and can be improved using reinforcement learning and representation-level steering techniques.…

14
r/LocalLLaMA community 1mo ago

Claude cli >= 2.1.154 breaks local use with vLLM by introducing "ctx", "msg" and "system" roles for API messages. This 1-line patch to vLLM fixes it.

diff --git a/vllm/entrypoints/anthropic/protocol.py b/vllm/entrypoints/anthropic/protocol.py index 3ebc17117..2d5726d73 100644 --- a/vllm/entrypoints/anthropic/protocol.py +++ b/vllm/entrypoints/anthropic/protocol.py @@ -65,7 +65,7 @@ class AnthropicContentBlock(BaseModel):…

29
r/LocalLLaMA community 1mo ago

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?

EDIT - IGNORE. I MADE A MISTAKE. The "better" model was 27b dense, not 35ba3b. Which also proves that 35b is not the best for coding related tasks. With 27b fp8 on VLLM - the prefil speed is around 1500tokens/sec and token gen is around 25tokens/sec. Ill need to run llama again…

37
r/LocalLLaMA community 1mo ago

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090? The project is starting to seem very bloated, at least readme wise. I use…

6
arXiv — Machine Learning research 1mo ago

Metric-Aware PCA as a Linear Instance of Geometric Deep Learning

arXiv:2605.27456v1 Announce Type: new Abstract: Geometric deep learning organises neural architectures around the symmetries of their data domain, with the choice of symmetry group serving as a geometric prior that determines what representations can be learned. Metric-Aware…

23

DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

Finally finished my LLM server: EPYC 9575F, 4× RTX 3090 (96GB VRAM), 768GB ECC RAM

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning

MTP has no impact on my Qwen3.6 MoE performance

Bayes-Sufficient Representations in Supervised Learning

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Federated Learning for Multi-Center Sepsis Early Prediction with Privacy-Preserving

Revisiting Privacy Amplification by Subsampling in Selective Release DPSGD

SANE Schema-aware Natural-language Evaluation of Biological Data

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

KletterMix: Climbing Toward High-Quality German Pretraining Data

Another shout out to llama.cpp build b9455 2x3090

Geometry-Aware Tabular Diffusion

Fast Unlearning at Scale via Margin Self-Correction

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

Multi-Objective Reference-Aligned Machine Unlearning

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

b9460

For Ling-2.6-1T, what would make the size feel justified first: quality per token, local serving reality, or long context stability?

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

ScaleMAP: Preserving Local Density and Neighborhood Structure in Low-Dimensional Embeddings

The Fast Mixing Mechanism for Differential Privacy

Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

An Efficient and Scalable Graph Condensation with Structure-Preserving

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama) [P]

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

MINISFORUM UM790 Pro

DynoSim: Simulating the Pareto Frontier

Step-3.7-Flash-NVFP4 thinking for many minutes

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

vLLM PR adding native HIP W4A16 kernel was merged

Step 3.7 Flash Config + Early Data on 2x RTX 6000's

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Claude cli >= 2.1.154 breaks local use with vLLM by introducing "ctx", "msg" and "system" roles for API messages. This 1-line patch to vLLM fixes it.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Metric-Aware PCA as a Linear Instance of Geometric Deep Learning