News / #inference Tag Inference 340 articles archived under #inference · RSS Sign in to follow The Information — AI news-outlet 17d ago Inside Tech’s Feverish Demand for Retatrutide, a Supposed Super Peptide For more than a decade, Dr. Molly Maloof has had a front-row seat to Silicon Valley’s ever-evolving health obsessions as a physician and founder of M3 Healthspan, a San Francisco–based concierge medical practice serving the tech elite. Lately, those conversations have… 25 Hugging Face Daily Papers research 17d ago From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion Abstract A multimodal image fusion approach uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal image fusion… 33 arXiv — NLP / Computation & Language research 18d ago MiniPIC: Flexible Position-Independent Caching in <100LOC arXiv:2606.13126v1 Announce Type: cross Abstract: Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV… 12 arXiv — NLP / Computation & Language research 18d ago Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty arXiv:2606.13452v1 Announce Type: cross Abstract: Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper… 18 Hugging Face Daily Papers research 18d ago ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction Abstract ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use… 10 Hugging Face Daily Papers research 18d ago Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency Abstract PACI enables efficient asynchronous pipeline training by controlling forward/backward weight inconsistency through local gradient accumulation, achieving higher throughput and faster training time-to-accuracy without sacrificing stability or memory usage. Generated by… 9 arXiv — Machine Learning research 19d ago Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data arXiv:2606.11272v1 Announce Type: new Abstract: Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial… 10 arXiv — Machine Learning research 19d ago LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach arXiv:2606.11463v1 Announce Type: new Abstract: Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a… 30 arXiv — Machine Learning research 19d ago Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification arXiv:2606.11650v1 Announce Type: new Abstract: Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary… 4 arXiv — NLP / Computation & Language research 19d ago External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs arXiv:2606.11806v1 Announce Type: new Abstract: Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost… 30 Hugging Face Daily Papers research 19d ago Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling Abstract Bebop addresses the efficiency bottleneck in reinforcement learning training of large language models by optimizing multi-token prediction techniques through entropy-aware sampling and novel training objectives that improve acceptance rates and inference throughput.… 28 r/LocalLLaMA community 19d ago FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the… 25 NVIDIA Developer Blog official-blog 19d ago Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This... 6 Hugging Face Daily Papers research 19d ago FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion Abstract FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation by preserving recent context and long-range anchors under fixed cache constraints. Generated by… 36 r/LocalLLaMA community 19d ago Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss? Hey everyone, I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100 , and I'm currently getting around 55 tokens/sec . I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output… 31 arXiv — Machine Learning research 20d ago SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs arXiv:2606.09868v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However,… 28 arXiv — Machine Learning research 20d ago QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning arXiv:2606.09869v1 Announce Type: new Abstract: Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs) on resource constrained devices while reducing overall training cost. However, determining… 22 arXiv — NLP / Computation & Language research 20d ago Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization arXiv:2606.09927v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large… 6 arXiv — Machine Learning research 20d ago Privacy-Preserving Credit Risk Prediction with Alternative Data arXiv:2606.10333v1 Announce Type: new Abstract: Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers' demographic, financial, and credit history data,… 14 arXiv — NLP / Computation & Language research 20d ago ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval arXiv:2606.10842v1 Announce Type: new Abstract: We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder… 31 arXiv — NLP / Computation & Language research 20d ago Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the… 18 NVIDIA Developer Blog official-blog 20d ago Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster... 6 Hugging Face Daily Papers research 20d ago AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents Abstract AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training… 32 r/MachineLearning community 20d ago Are privacy-preserving techniques actually being used in production ML systems? [D] I've been reading more about privacy-preserving ML approaches such as differential privacy, federated learning, and on-device inference. The research literature is fairly active, but I'm curious about real-world adoption. For those working in industry: Are these techniques being… 16 arXiv — Machine Learning research 21d ago Enabling KV Caching of Shared Prefix for Diffusion Language Models arXiv:2606.07571v1 Announce Type: new Abstract: Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means… 11 arXiv — Machine Learning research 21d ago Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching arXiv:2606.07684v1 Announce Type: new Abstract: Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token… 16 arXiv — Machine Learning research 21d ago Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency arXiv:2606.07881v1 Announce Type: new Abstract: Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer… 5 Hugging Face Daily Papers research 21d ago CoVEBench: Can Video Editing Models Handle Complex Instructions? Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by… 19 NVIDIA Developer Blog official-blog 21d ago Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step... 34 r/MachineLearning community 21d ago Université Paris Saclay or TU Delft for Applied Mathematics Masters [R] I've been admitted into both UPS and TUD for Applied Mathematics, and I wanted to hear some advice on which one would be better. For context, I'd like to work in some form of AI research, most likely within industry. At the moment, I'm most interested in privacy preserving… 8 Hacker News — AI on Front Page community 21d ago MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second Article URL: https://mimo.xiaomi.com/blog/mimo-tilert-1000tps Comments URL: https://news.ycombinator.com/item?id=48446639 Points: 252 # Comments: 175 30 Hugging Face Daily Papers research 21d ago Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them Abstract PhaseLock is a training-free framework that improves physical consistency in image-to-video diffusion models by preserving motion priors from early-step inference throughout the denoising process. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Image-to-Video diffusion… 17 arXiv — Machine Learning research 22d ago Accelerating Reproducible Research in Synthetic EHR Generation arXiv:2606.06990v1 Announce Type: new Abstract: The generation of high-fidelity synthetic Electronic Health Records (EHR) is crucial for advancing medical research while preserving patient privacy. However, head-to-head comparison of existing generative models is hindered by… 13 arXiv — Machine Learning research 22d ago Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging arXiv:2606.07196v1 Announce Type: new Abstract: Classical sparse Type-II Bayesian methods for M/EEG brain imaging support joint estimation of source and noise hyperparameters, but rely on fixed iterative update rules. Although these updates are principled and interpretable,… 28 arXiv — Machine Learning research 22d ago Closed-Form Spectral Regularization for Multi-Task Model Merging arXiv:2606.07289v1 Announce Type: new Abstract: Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models.… 38 arXiv — Machine Learning research 22d ago Breaking the Ice: Analyzing Cold Start Latency in vLLM arXiv:2606.07362v1 Announce Type: new Abstract: As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de facto inference engine of choice for many inference workloads. Although popular,… 11 arXiv — Machine Learning research 22d ago Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling arXiv:2606.07404v1 Announce Type: new Abstract: This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense… 28 arXiv — NLP / Computation & Language research 22d ago KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026 arXiv:2606.07240v1 Announce Type: new Abstract: Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026… 21 arXiv — NLP / Computation & Language research 22d ago MMAE: A Massive Multitask Audio Editing Benchmark arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,… 8 arXiv — NLP / Computation & Language research 22d ago DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast arXiv:2606.07356v1 Announce Type: cross Abstract: Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free… 25 Hugging Face Daily Papers research 22d ago SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents Abstract SubtleMemory benchmark evaluates AI agents' ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships. Generated by… 33 r/LocalLLaMA community 23d ago dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model Im into both HPC and 3D reconstruction, so I built this as a side project. dvlt.cu is a single 5MB binary: - No python, torch, TF, ONNX, llama.cpp, vLLM, or huggingface runtime - Nearly no dependencies: only cuBLASLt (shipped with libcuda ) + cuTLASS ( header only lib ) - mmap'd… 21 r/LocalLLaMA community 23d ago 120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result! By using llama.cpp patched with the… 17 r/LocalLLaMA community 23d ago It felt good to return my Asus Spark It's an incredible little package but too expensive of a price to pay for the performance and I simply didn't want to be part of the great "Superchip lie" - it could be super, but its super ruined by its limited memory bandwidth even though it *could* be 2x throughput - it… 31 r/LocalLLaMA community 23d ago Serving TTS/cloning models on llama.cpp? Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a separate container or conda for each model I… 17 r/LocalLLaMA community 23d ago Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding Up to 5.8x throughput speedup on Qwen3 Paper : https://arxiv.org/abs/2605.29707 Code : https://github.com/jianuo-huang/Domino Models : https://huggingface.co/Huang2020   submitted by   /u/pmttyji [link]   [comments] 6 Hugging Face Daily Papers research 24d ago LLM Anonymization Against Agentic Re-Identification Abstract AURA is an LLM-powered anonymization framework that balances privacy protection against agentic web-search re-identification while preserving contextual utility through adaptive privacy scopes and mask-reconstruct methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 25 r/LocalLLaMA community 24d ago Qwen 3.6-27B on vLLM with dual RTX 3090s: looking for launch parameters Hi everyone. Please share your working launch commands for running Qwen 3.6-27B via vLLM on dual RTX 3090s (both running in PCIe 4.0 x8). I'm interested in setups both with and without an NVLink bridge. I'm familiar with the club-3090 repo, but their ready-to-use vLLM recipes… 8 r/LocalLLaMA community 24d ago I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) Cheap KV cache with good precision? Sign me up! Oh, vLLM… 12 llama.cpp releases dev-tools 25d ago b9521 CUDA: enroll mul_mat_vec_q_moe into pdl ( #24087 ) Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8… 10 Page 3 of 7 · 340 articles ← Newer Older →