News / #inference Tag Inference 40 articles archived under #inference · RSS Sign in to follow r/LocalLLaMA community 5h ago qwen3.6 just stops https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens on opencode. Running with vLLM with… 17 arXiv — Machine Learning research 15h ago Rotation-Preserving Supervised Fine-Tuning arXiv:2605.10973v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight… 22 arXiv — Machine Learning research 15h ago Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies… 17 arXiv — NLP / Computation & Language research 15h ago ReAD: Reinforcement-Guided Capability Distillation for Large Language Models arXiv:2605.11290v1 Announce Type: new Abstract: Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most… 27 arXiv — NLP / Computation & Language research 15h ago SOMA: Efficient Multi-turn LLM Serving via Small Language Model arXiv:2605.11317v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every… 33 arXiv — NLP / Computation & Language research 15h ago PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents arXiv:2605.12260v1 Announce Type: new Abstract: Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the… 8 arXiv — NLP / Computation & Language research 15h ago ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging arXiv:2605.12419v1 Announce Type: new Abstract: Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates… 24 arXiv — NLP / Computation & Language research 15h ago fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum arXiv:2605.11403v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked… 38 r/LocalLLaMA community 21h ago Is using vLLM actually worth it if you aren't serving the model to other people? So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I happen to have an AMD GPU. The… 4 NVIDIA Developer Blog official-blog 1d ago How to Eliminate Pipeline Friction in AI Model Serving The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a... 17 r/LocalLLaMA community 1d ago Needle: We Distilled Gemini Tool Calling Into a 26M Model We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted… 4 r/LocalLLaMA community 1d ago New Qwen3.6 27b Autoround Quant (int4) Best Recipe I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm. I decided to use a similar Autoround recipe but use the "autorund-best" preset instead, it uses more iterations to… 34 r/LocalLLaMA community 1d ago Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across each of 11 categories Models:… 17 vLLM releases dev-tools 3d ago v0.20.2 vLLM v0.20.2 Highlights This release features 6 commits from 6 contributors (0 new)! This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL Bug Fixes DeepSeek V4 sparse attention : Re-enable the persistent topk path on Hopper and ensure the memset… 11 vLLM releases dev-tools 9d ago v0.20.1 vLLM v0.20.1 This is a patch release on top of v0.20.0 primarily focused on DeepSeek V4 stabilization and performance improvements , along with several important bug fixes. DeepSeek V4 Base model support ( #41006 ). Multi-stream pre-attention GEMM ( #41061 ), configurable… 37 NVIDIA Developer Blog official-blog 13d ago Speed Up Unreal Engine NNE Inference with NVIDIA TensorRT for RTX Runtime Neural network techniques are increasingly used in computer graphics to boost image quality, improve performance, and streamline content creation. Approaches... 17 MIT News — AI research 14d ago Enabling privacy-preserving AI training on everyday devices A new method could bring more accurate and efficient AI models to high-stakes applications like health care and finance, even in under-resourced settings. 13 Smol AI News news-outlet 15d ago not much happened today **vLLM v0.20.0** introduces significant improvements in memory and MoE serving efficiency, including **TurboQuant 2-bit KV cache** for **4× KV capacity** and a **2.1% latency improvement**. The update supports multiple hardware platforms like **DeepSeek V4 MegaMoE on… 9 vLLM releases dev-tools 15d ago v0.20.0 vLLM v0.20.0 Highlights This release features 752 commits from 320 contributors (123 new)! DeepSeek V4 : Initial DeepSeek V4 support landed ( #40860 ), with DSML token-leakage fix in DSV4/3.2 ( #40806 ), DSA + MTP IMA fix ( #40772 ), and a silu clamp limit on the shared expert (… 33 NVIDIA Developer Blog official-blog 22d ago Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy... 31 NVIDIA Developer Blog official-blog 1mo ago Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU... 17 NVIDIA Developer Blog official-blog 1mo ago NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak... 14 MIT News — AI research 1mo ago AI system learns to keep warehouse robot traffic running smoothly This new approach adapts to decide which robots should get the right of way at every moment, avoiding congestion and increasing throughput. 29 NVIDIA Developer Blog official-blog 1mo ago Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition... 38 NVIDIA Developer Blog official-blog 1mo ago Deploying Disaggregated LLM Inference Workloads on Kubernetes As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages... 15 Hugging Face official-blog 1mo ago Holotron-12B - High Throughput Computer Use Agent Back to Articles Holotron-12B - High Throughput Computer Use Agent Team Article Published March 17, 2026 Upvote 22 Pierre-Louis Cedoz plcedoz38 Hcompany Hamza Benchekroun hamza-hcompany Hcompany Aurélien Lac h-aurelien-lac Hcompany delfosse aureliendelfosseathai Hcompany Tony Wu… 6 Smol AI News news-outlet 1mo ago not much happened today **Moonshot's Attention Residuals** paper introduced an input-dependent attention mechanism over prior layers with a **1.25x compute advantage** and less than **2% inference latency overhead**, validated on **Kimi Linear 48B total / 3B active**. The paper sparked debate on… 26 Smol AI News news-outlet 2mo ago not much happened today **NVIDIA’s Nemotron 3 Super** is a **120B parameter / ~12B active** open model featuring a **hybrid Mamba-Transformer / SSM Latent MoE** architecture and **1M context window**, delivering up to **2.2x faster inference than GPT-OSS-120B** in FP4 with strong throughput gains. It… 10 NVIDIA Developer Blog official-blog 2mo ago Removing the Guesswork from Disaggregated Serving Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal... 37 MIT News — AI research 2mo ago New method could increase LLM training efficiency By leveraging idle computing time, researchers can double the speed of model training while preserving accuracy. 13 NVIDIA Developer Blog official-blog 2mo ago Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as... 25 NVIDIA Developer Blog official-blog 2mo ago Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges... 30 Smol AI News news-outlet 3mo ago Z.ai GLM-5: New SOTA Open Weights LLM **Zhipu AI** launched **GLM-5**, an **Opus-class** model scaling from **355B to 744B parameters** with **DeepSeek Sparse Attention** integration for cost-efficient long-context serving. GLM-5 achieves **SOTA on BrowseComp** and leads on **Vending Bench 2**, focusing on office… 18 NVIDIA Developer Blog official-blog 3mo ago Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture... 31 Smol AI News news-outlet 3mo ago ElevenLabs $500m Series D at $11B, Cerebras $1B Series H at $23B, Vibe Coding -> Agentic Engineering **Google's Gemini 3** is being integrated widely, including a new **Chrome side panel** and **Nano Banana** UX features, with rapid adoption and a **78% unit-cost reduction** in serving costs. The **Gemini app** reached **750M+ MAU** in Q4 2025, nearing ChatGPT's user base.… 23 Smol AI News news-outlet 3mo ago Context Graphs: Hype or actually Trillion-dollar opportunity? **Zhipu AI** launched **GLM-OCR**, a lightweight **0.9B** multimodal OCR model excelling in complex document understanding with top benchmark scores and day-0 deployment support from **lmsys**, **vllm**, and **novita labs**. **Ollama** enabled local-first usage with easy offline… 28 Smol AI News news-outlet 3mo ago Open Responses: explicit spec for OpenAI's Responses API supported by OpenRouter, Ollama, Huggingface, vLLM, et al **OpenAI** launched the **Open Responses** API spec, an open-source, multi-provider standard for interoperable LLM APIs designed to simplify agent stacks and tooling. Early adopters like **ollama** and **vLLM** support the spec, while notable absences include **anthropic** and… 4 Smol AI News news-outlet 4mo ago Meta Superintelligence Labs acquires Manus AI for over $2B, at $100M ARR, 9months after launch **Manus** achieved a rapid growth trajectory in 2025, raising **$500M** from Benchmark and reaching **$100M ARR** before being acquired by **Meta** for an estimated **$4B**. The **vLLM** team launched a dedicated community site with new resources, while performance issues with… 30 Smol AI News news-outlet 4mo ago not much happened today **GLM-4.7** and **MiniMax M2.1** open-weight model releases highlight day-0 ecosystem support, coding throughput, and agent workflows, with GLM-4.7 achieving a +9.5% improvement over GLM-4.6 and MiniMax M2.1 positioned as an OSS Claude-like MoE model with 230B total parameters… 18 Eugene Yan research 38mo ago How to Write Data Labeling/Annotation Guidelines Writing good instructions to achieve high precision and throughput. 5