News / #inference Tag Inference 339 articles archived under #inference · RSS Sign in to follow arXiv — Machine Learning research 1h ago Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter arXiv:2606.28441v1 Announce Type: new Abstract: Online latent state estimation constitutes a fundamental challenge within the artificial intelligence field, serving as a foundational tool for diverse applications, including sequential decision making, anomaly and change-point… 21 arXiv — Machine Learning research 1h ago DiLaServe: High SLO Attainment Serving for Diffusion Language Models arXiv:2606.29094v1 Announce Type: new Abstract: Diffusion language models (DLMs) have recently emerged as a promising alternative to conventional autoregressive language models. By generating multiple tokens in parallel during each denoising step, they offer higher inference… 36 arXiv — Machine Learning research 1h ago Prototype Latent World Model Replay for Class-Incremental Learning arXiv:2606.29465v1 Announce Type: new Abstract: Class-incremental learning requires a model to learn new classes while preserving decision regions for old ones. This is difficult when raw old samples are no longer available. We propose Prototype Latent World Model Replay, a… 8 arXiv — NLP / Computation & Language research 1h ago Structure-Preserving Document Translation via Multi-Stage LLM Pipeline: A Case Study in Marathi arXiv:2606.28796v1 Announce Type: new Abstract: Government documents in India are predominantly issued in regional languages such as Marathi, creating substantial accessibility barriers for non-native readers, interstate administrative bodies, and policy analysts. Although… 30 arXiv — NLP / Computation & Language research 1h ago A Comparative Study on Affective Cues in Text Embeddings Across Psychological Emotion Theories arXiv:2606.29068v1 Announce Type: new Abstract: Text encoders are known for their utility in natural language processing, as they are able to efficiently compress inputs into dense vectors while preserving semantics. These models have been applied to affective computing, in… 19 r/MachineLearning community 17h ago Cerebras OpenAI deal capacity has effectively killed the waitlist for everyone else [D] I’m pretty annoyed. We’re a small AI startup building a real-time coding agent. Our p95 latency requirements are tight (and self imposed, but thats the product). We need sustained high-throughput inference with ~1-2k tokens/second. Been on the Cerebras waitlist for months trying… 25 arXiv — Machine Learning research 1d ago NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning arXiv:2606.27771v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of… 8 arXiv — Machine Learning research 1d ago Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings arXiv:2606.27997v1 Announce Type: new Abstract: Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets… 21 arXiv — Machine Learning research 1d ago Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models arXiv:2606.27593v1 Announce Type: cross Abstract: We introduce a categorical framework called ODYSSEY for constructing verifiable, local truth-preserving foundation models as compositions of foundries: building-block architectural components that specify a cover of local… 27 arXiv — NLP / Computation & Language research 1d ago Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving arXiv:2606.27457v1 Announce Type: cross Abstract: Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones.… 20 arXiv — NLP / Computation & Language research 1d ago RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory arXiv:2605.06675v2 Announce Type: replace-cross Abstract: Large language models cache all previously computed key-value (KV) pairs during generation, and this KV cache grows linearly with sequence length, making it a primary memory bottleneck for serving. Quantizing the KV cache… 5 r/LocalLLaMA community 1d ago High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps I got GLM-5.2 NVFP4 running on four DGX Sparks at 128K context. This is still a niche/hacky setup, but it is now a real serving point rather than just a proof of life. Objective : A high quality 4-bit quant running on 4x spark. Model: https://huggingface.co/Mapika/GLM-5.2-NVFP4… 9 r/LocalLLaMA community 1d ago Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1) Follow-up to my previous Ornith-1.0-35B Q3_K_M post. I grafted a native MTP draft head onto the IQ4_XS body (head at Q6) for self-speculative decode, single GPU, llama.cpp: 1.3-1.35x single-stream decode (172.6 -> 233.8 tok/s). Next-token distribution is byte-identical to… 11 Hacker News — AI on Front Page community 2d ago AMD Strix Halo RDMA Cluster Setup Guide Article URL: https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md Comments URL: https://news.ycombinator.com/item?id=48703258 Points: 207 # Comments: 61 22 arXiv — Machine Learning research 4d ago PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs arXiv:2606.26666v1 Announce Type: new Abstract: Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels… 20 arXiv — Machine Learning research 4d ago Quantization in Federated Learning: Methods, Challenges and Future Directions arXiv:2606.26822v1 Announce Type: new Abstract: Federated Learning (FL) has become a foundational paradigm for privacy-preserving distributed intelligence, yet its scalability remains fundamentally constrained by communication bottlenecks, device heterogeneity, and the… 20 arXiv — NLP / Computation & Language research 4d ago AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification arXiv:2606.26452v1 Announce Type: new Abstract: To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but… 31 arXiv — NLP / Computation & Language research 4d ago Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline arXiv:2606.27347v1 Announce Type: new Abstract: Whether political elites organise into rent-seeking coalitions that capture public resources or civic networks that sustain governance is a central question in comparative politics. Yet observing these complex, informal, and… 11 Hugging Face official-blog 4d ago Run a vLLM Server on HF Jobs in One Command Back to Articles a]:hidden"> Run a vLLM Server on HF Jobs in One Command Published June 26, 2026 Update on GitHub Upvote - Quentin Gallouédec qgallouedec You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers… 18 r/LocalLLaMA community 4d ago LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels Everything runs locally in your browser using custom WebGPU kernels written by Fable 5 (before it was shut down) and Opus 4.8. The video was recorded on my M4 Max. Model: LiquidAI/LFM2.5-230M ( GGUF ) Demo: https://huggingface.co/spaces/webml-community/lfm2-webgpu-kernels  … 37 NVIDIA Developer Blog official-blog 4d ago Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the... 38 Vercel — AI dev-tools 4d ago AI SDK 7 is now available AI SDK 7 is a major release for building production agents in TypeScript. The SDK has grown from model calls and chat primitives into a broader agent platform for developing, running, integrating, and observing agents across text, audio, realtime, image, and video. Every major… 8 Smol AI News news-outlet 4d ago not much happened today **Z.ai's GLM-5.2** leads in coding and agent benchmarks with top scores like **1595** on Code Arena: Frontend and **34.29%** reasoning accuracy with zero failures. Databricks improved GLM-5.2 speed to **392 tok/s** using hardware and optimizations. **Ornith-1.0**, a new… 13 arXiv — Machine Learning research 5d ago TL++: Accuracy and Privacy Preserving Traversal Learning for Distributed Intelligent Systems arXiv:2606.25627v1 Announce Type: new Abstract: Distributed intelligent systems increasingly need to train across data silos without centralizing raw data. Federated learning keeps data local but can suffer under heterogeneous partitions and requires repeated full-model… 22 arXiv — NLP / Computation & Language research 5d ago Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding arXiv:2606.24957v1 Announce Type: new Abstract: While speculative decoding improves inference throughput for multi-batch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency.… 19 Hugging Face Daily Papers research 5d ago RoPE-Aware Bit Allocation for KV-Cache Quantization Abstract Block-GTQ introduces a RoPE-aware bit allocation method for key-cache quantization that improves attention accuracy and downstream performance through adaptive bit distribution and packed cache serving. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing low-bit… 22 r/LocalLLaMA community 5d ago Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing TL;DR: the recipe's image-build mods aren't actually public – I reconstructed them from the public kernels (with Claude) – and you have to build vLLM at the author's exact pinned ref or the real AWQ weights crash on load. Running now at ~9.4 tok/s on my own 4× GB10. Saw a link… 20 r/LocalLLaMA community 5d ago Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model? I'm wondering if anyone else has come across this. I've tested the same model on llama.cpp and vLLM with similar settings and quantizations. The performance and concurrency in vLLM are much noticeably better, but sometimes the model feels less reliable. Some things I've noticed:… 27 r/LocalLLaMA community 5d ago I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system. G'day. This is part 3 on my Local LLM adventures. I have a crazy system hacked server-to-desktop system : Component Spec GPUs 2x Hopper H100, 96 GB HBM3 each CPUs 2x Grace, 72 cores each Host memory 480 GB LPDDR5X per Grace, 960 GB total So I can run technically run GLM5.2.… 34 r/LocalLLaMA community 5d ago Qwen3.6 27B more dumb in vLLM compared to llama.cpp Hello, I recently bought a new RTX 5060Ti to pair with the RTX 5060Ti I already own, now I have 32GB of VRAM. Up until now for convenience I've used llama.cpp, for goodness' sake it works excellently when only 1 user is using it, but now there are 2 of us using it and llama.cpp… 34 r/LocalLLaMA community 5d ago Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT Full-document parsing instead of cropped-region OCR 32K output length for long OCR sequences Base and gundam image modes for different document layouts Transformers inference + SGLang serving with OpenAI-compatible streaming requests Built to push DeepSeek-OCR-style document… 22 arXiv — Machine Learning research 6d ago ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation arXiv:2606.23898v1 Announce Type: new Abstract: Distilling conditional diffusion models aims to transfer the behavior of a large teacher to a smaller student while preserving alignment across conditioning inputs. Unlike recognition tasks, knowledge distillation in conditional… 14 arXiv — Machine Learning research 6d ago Offline Reinforcement Learning for Warehouse SLAM Throughput Control arXiv:2606.23978v1 Announce Type: new Abstract: We present an offline reinforcement learning (RL) framework for optimizing SLAM throughput control in a warehouse fulfillment environment. SLAM (Scan/Label/Apply/Manifest) throughput directly influences system congestion and… 18 arXiv — Machine Learning research 6d ago Learning to Trigger: Reinforcement Learning at the Large Hadron Collider arXiv:2606.23993v1 Announce Type: new Abstract: High-throughput scientific facilities such as the Large Hadron Collider depend on real-time event filtering (\textit{triggering}) under tight constraints on bandwidth, latency, and storage. In practice, trigger menus are largely… 24 arXiv — Machine Learning research 6d ago EnerInfer: Energy-Aware On-Device LLM Inference arXiv:2606.23001v1 Announce Type: cross Abstract: On-device LLM inference is increasingly attractive for privacy-preserving, reliable, and cost-effective deployment, yet its energy and thermal costs remain a critical bottleneck. Existing systems primarily optimize for decoding… 13 arXiv — NLP / Computation & Language research 6d ago A P\={a}ninian Foundation for Indic Language Processing arXiv:2606.24172v1 Announce Type: new Abstract: More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks… 24 arXiv — Machine Learning research 6d ago CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation arXiv:2606.24506v1 Announce Type: cross Abstract: Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is… 8 arXiv — NLP / Computation & Language research 6d ago Qwen-AgentWorld: Language World Models for General Agents arXiv:2606.24597v1 Announce Type: new Abstract: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can… 8 arXiv — NLP / Computation & Language research 6d ago Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity arXiv:2606.24623v1 Announce Type: new Abstract: Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent… 30 arXiv — NLP / Computation & Language research 6d ago ComputeFHE: A Privacy-Preserving General-Purpose Computation Library arXiv:2606.24379v1 Announce Type: cross Abstract: Fully Homomorphic Encryption (FHE) enables computations to be performed directly on encrypted data while preserving data confidentiality. However, its practical applications remain limited by high computational costs and… 6 Vercel — AI dev-tools 6d ago GLM 5.2 Fast via Wafer now available on AI Gateway GLM 5.2 Fast via Wafer is now available on AI Gateway . Based on our own benchmarking across small-context, large-context, and tool-call scenarios, Wafer delivers a 2x higher throughput than other providers serving GLM-5.2 on serverless, leading on decode and end-to-end speed… 7 Hugging Face Daily Papers research 6d ago Vera: A Layered Diffusion Model for Content-Preserving Video Editing Abstract Vera is a layered diffusion framework that preserves video content during editing by generating edit layers and alpha mattes through a Mixture-of-Transformers architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video diffusion models have enabled remarkable… 10 r/MachineLearning community 6d ago What's your biggest pain point when choosing between cloud GPU providers for LLM inference?[R] Trying to understand how other people make this decision. Do you compare $/hr, $/token, throughput, reliability? Is there a tool or resource you rely on, or are you just doing the math manually? Asking because I'm an ML engineer who's been doing this in spreadsheets and… 14 r/LocalLLaMA community 7d ago New ablation operator. (apostate) Today I added a new operator to apostate. This new operator is a contrastive co-vector edit E = I − R Dᵀ . Removing the refusal direction outright disturbs benign behavior, while naively preserving all harmless variance along it leaves the refusal that is entangled with general… 34 r/LocalLLaMA community 8d ago A100 slow Qwen3.6-27B-FP8 Setting up a server for someone who has an A100 80GB, even though this doesn't natively support FP8 does 43tps decode sound too low for single request? For comparison the exact same vllm config on my RTX 6000 PRO runs the same single request test at 130tps. For 8 concurrent… 11 r/LocalLLaMA community 8d ago Qwen 27B for planning, Qwen 35B-A3B for execution? My 32GB unified memory setup runs both, though 27B even with MTP is something like 7-10 tok/sec. Usable but not real time by any means. (~18 tok/sec with 35B-A3B) Would it be worth using 27B to plan long horizon tasks, put together the PLAN.md, and have 35B-A4B iterate over it… 14 r/LocalLLaMA community 8d ago ROCm vs Vulkan vs vLLM on Dual R9700's Just wanted to share these numbers I saw running Qwen3.6 35BA3 and Qwen3.6 27B and the big increase I saw going to vLLM. I was just expecting better concurrency but ended up with a lot better speeds. llama.cpp services Running ROCm and Vulkan Model Backend Gen 35B-A3B Q6_K_XL… 19 r/LocalLLaMA community 8d ago R9700 abysmal performance, getting desparate I've been trying to get my 2x R9700 setup to work for the past two weeks. This has been such a time sink I wish I had just gone with nvidia. At this point I'm close to selling the cards. I need vLLM. This is a dedicated setup for multi-user serving. I've tried the… 17 r/LocalLLaMA community 9d ago I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config. If you run open-source models and want to understand what's actually happening under the hood — I spent the last few months writing a 15-part series that covers the full stack from tokenization to production serving. Most articles are grounded in Gemma 4 12B as the running… 19 r/MachineLearning community 9d ago An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P] I've been working through the internals of LLM inference and writing up what I learn as an open, in-progress handbook. Just wrapped another chapter on GPU execution and memory internals: why a GPU sits mostly idle during inference, how the memory hierarchy gates throughput, and… 13 Page 1 of 7 · 339 articles Older →