Tag

Inference

340 articles archived under #inference · RSS

The Information — AI news-outlet 17d ago

Inside Tech’s Feverish Demand for Retatrutide, a Supposed Super Peptide

For more than a decade, Dr. Molly Maloof has had a front-row seat to Silicon Valley’s ever-evolving health obsessions as a physician and founder of M3 Healthspan, a San Francisco–based concierge medical practice serving the tech elite. Lately, those conversations have…

25
Hugging Face Daily Papers research 17d ago

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Abstract A multimodal image fusion approach uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal image fusion…

33
arXiv — NLP / Computation & Language research 18d ago

MiniPIC: Flexible Position-Independent Caching in <100LOC

arXiv:2606.13126v1 Announce Type: cross Abstract: Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV…

12
arXiv — NLP / Computation & Language research 18d ago

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

arXiv:2606.13452v1 Announce Type: cross Abstract: Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper…

18
Hugging Face Daily Papers research 18d ago

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Abstract ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use…

10
Hugging Face Daily Papers research 18d ago

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

Abstract PACI enables efficient asynchronous pipeline training by controlling forward/backward weight inconsistency through local gradient accumulation, achieving higher throughput and faster training time-to-accuracy without sacrificing stability or memory usage. Generated by…

9
arXiv — Machine Learning research 19d ago

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

arXiv:2606.11272v1 Announce Type: new Abstract: Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial…

10
arXiv — Machine Learning research 19d ago

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

arXiv:2606.11463v1 Announce Type: new Abstract: Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a…

30
arXiv — Machine Learning research 19d ago

Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

arXiv:2606.11650v1 Announce Type: new Abstract: Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary…

4
arXiv — NLP / Computation & Language research 19d ago

External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

arXiv:2606.11806v1 Announce Type: new Abstract: Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost…

30
Hugging Face Daily Papers research 19d ago

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Abstract Bebop addresses the efficiency bottleneck in reinforcement learning training of large language models by optimizing multi-token prediction techniques through entropy-aware sampling and novel training objectives that improve acceptance rates and inference throughput.…

28
r/LocalLLaMA community 19d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the…

25
NVIDIA Developer Blog official-blog 19d ago

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This...

6
Hugging Face Daily Papers research 19d ago

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

Abstract FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation by preserving recent context and long-range anchors under fixed cache constraints. Generated by…

36
r/LocalLLaMA community 19d ago

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

Hey everyone, I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100 , and I'm currently getting around 55 tokens/sec . I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output…

31
arXiv — Machine Learning research 20d ago

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

arXiv:2606.09868v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However,…

28
arXiv — Machine Learning research 20d ago

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

arXiv:2606.09869v1 Announce Type: new Abstract: Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs) on resource constrained devices while reducing overall training cost. However, determining…

22
arXiv — NLP / Computation & Language research 20d ago

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

arXiv:2606.09927v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large…

6
arXiv — Machine Learning research 20d ago

Privacy-Preserving Credit Risk Prediction with Alternative Data

arXiv:2606.10333v1 Announce Type: new Abstract: Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers' demographic, financial, and credit history data,…

14
arXiv — NLP / Computation & Language research 20d ago

ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

arXiv:2606.10842v1 Announce Type: new Abstract: We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder…

31
arXiv — NLP / Computation & Language research 20d ago

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the…

18
NVIDIA Developer Blog official-blog 20d ago

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster...

6
Hugging Face Daily Papers research 20d ago

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Abstract AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training…

32
r/MachineLearning community 20d ago

Are privacy-preserving techniques actually being used in production ML systems? [D]

I've been reading more about privacy-preserving ML approaches such as differential privacy, federated learning, and on-device inference. The research literature is fairly active, but I'm curious about real-world adoption. For those working in industry: Are these techniques being…

16
arXiv — Machine Learning research 21d ago

Enabling KV Caching of Shared Prefix for Diffusion Language Models

arXiv:2606.07571v1 Announce Type: new Abstract: Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means…

11
arXiv — Machine Learning research 21d ago

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

arXiv:2606.07684v1 Announce Type: new Abstract: Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token…

16
arXiv — Machine Learning research 21d ago

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

arXiv:2606.07881v1 Announce Type: new Abstract: Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer…

5
Hugging Face Daily Papers research 21d ago

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by…

19
NVIDIA Developer Blog official-blog 21d ago

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step...

34
r/MachineLearning community 21d ago

Université Paris Saclay or TU Delft for Applied Mathematics Masters [R]

I've been admitted into both UPS and TUD for Applied Mathematics, and I wanted to hear some advice on which one would be better. For context, I'd like to work in some form of AI research, most likely within industry. At the moment, I'm most interested in privacy preserving…

8
Hacker News — AI on Front Page community 21d ago

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

Article URL: https://mimo.xiaomi.com/blog/mimo-tilert-1000tps Comments URL: https://news.ycombinator.com/item?id=48446639 Points: 252 # Comments: 175

30
Hugging Face Daily Papers research 21d ago

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

Abstract PhaseLock is a training-free framework that improves physical consistency in image-to-video diffusion models by preserving motion priors from early-step inference throughout the denoising process. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Image-to-Video diffusion…

17
arXiv — Machine Learning research 22d ago

Accelerating Reproducible Research in Synthetic EHR Generation

arXiv:2606.06990v1 Announce Type: new Abstract: The generation of high-fidelity synthetic Electronic Health Records (EHR) is crucial for advancing medical research while preserving patient privacy. However, head-to-head comparison of existing generative models is hindered by…

13
arXiv — Machine Learning research 22d ago

Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging

arXiv:2606.07196v1 Announce Type: new Abstract: Classical sparse Type-II Bayesian methods for M/EEG brain imaging support joint estimation of source and noise hyperparameters, but rely on fixed iterative update rules. Although these updates are principled and interpretable,…

28
arXiv — Machine Learning research 22d ago

Closed-Form Spectral Regularization for Multi-Task Model Merging

arXiv:2606.07289v1 Announce Type: new Abstract: Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models.…

38
arXiv — Machine Learning research 22d ago

Breaking the Ice: Analyzing Cold Start Latency in vLLM

arXiv:2606.07362v1 Announce Type: new Abstract: As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de facto inference engine of choice for many inference workloads. Although popular,…

11
arXiv — Machine Learning research 22d ago

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

arXiv:2606.07404v1 Announce Type: new Abstract: This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense…

28
arXiv — NLP / Computation & Language research 22d ago

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

arXiv:2606.07240v1 Announce Type: new Abstract: Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026…

21
arXiv — NLP / Computation & Language research 22d ago

MMAE: A Massive Multitask Audio Editing Benchmark

arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,…

8
arXiv — NLP / Computation & Language research 22d ago

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

arXiv:2606.07356v1 Announce Type: cross Abstract: Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free…

25
Hugging Face Daily Papers research 22d ago

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

Abstract SubtleMemory benchmark evaluates AI agents' ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships. Generated by…

33
r/LocalLLaMA community 23d ago

dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model

Im into both HPC and 3D reconstruction, so I built this as a side project. dvlt.cu is a single 5MB binary: - No python, torch, TF, ONNX, llama.cpp, vLLM, or huggingface runtime - Nearly no dependencies: only cuBLASLt (shipped with libcuda ) + cuTLASS ( header only lib ) - mmap'd…

21
r/LocalLLaMA community 23d ago

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result! By using llama.cpp patched with the…

17
r/LocalLLaMA community 23d ago

It felt good to return my Asus Spark

It's an incredible little package but too expensive of a price to pay for the performance and I simply didn't want to be part of the great "Superchip lie" - it could be super, but its super ruined by its limited memory bandwidth even though it *could* be 2x throughput - it…

31
r/LocalLLaMA community 23d ago

Serving TTS/cloning models on llama.cpp?

Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a separate container or conda for each model I…

17
r/LocalLLaMA community 23d ago

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Up to 5.8x throughput speedup on Qwen3 Paper : https://arxiv.org/abs/2605.29707 Code : https://github.com/jianuo-huang/Domino Models : https://huggingface.co/Huang2020   submitted by   /u/pmttyji [link]   [comments]

6
Hugging Face Daily Papers research 24d ago

LLM Anonymization Against Agentic Re-Identification

Abstract AURA is an LLM-powered anonymization framework that balances privacy protection against agentic web-search re-identification while preserving contextual utility through adaptive privacy scopes and mask-reconstruct methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

25
r/LocalLLaMA community 24d ago

Qwen 3.6-27B on vLLM with dual RTX 3090s: looking for launch parameters

Hi everyone. Please share your working launch commands for running Qwen 3.6-27B via vLLM on dual RTX 3090s (both running in PCIe 4.0 x8). I'm interested in setups both with and without an NVLink bridge. I'm familiar with the club-3090 repo, but their ready-to-use vLLM recipes…

8
r/LocalLLaMA community 24d ago

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) Cheap KV cache with good precision? Sign me up! Oh, vLLM…

12
llama.cpp releases dev-tools 25d ago

b9521

CUDA: enroll mul_mat_vec_q_moe into pdl ( #24087 ) Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8…

10

Inside Tech’s Feverish Demand for Retatrutide, a Supposed Super Peptide

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

MiniPIC: Flexible Position-Independent Caching in <100LOC

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

Privacy-Preserving Credit Risk Prediction with Alternative Data

ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Are privacy-preserving techniques actually being used in production ML systems? [D]

Enabling KV Caching of Shared Prefix for Diffusion Language Models

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

Université Paris Saclay or TU Delft for Applied Mathematics Masters [R]

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

Accelerating Reproducible Research in Synthetic EHR Generation

Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging

Closed-Form Spectral Regularization for Multi-Task Model Merging

Breaking the Ice: Analyzing Cold Start Latency in vLLM

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

MMAE: A Massive Multitask Audio Editing Benchmark

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

It felt good to return my Asus Spark

Serving TTS/cloning models on llama.cpp?

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

LLM Anonymization Against Agentic Re-Identification

Qwen 3.6-27B on vLLM with dual RTX 3090s: looking for launch parameters

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

b9521