Tag

Inference

40 articles archived under #inference · RSS

r/LocalLLaMA community 5h ago

qwen3.6 just stops

https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens on opencode. Running with vLLM with…

17
arXiv — Machine Learning research 15h ago

Rotation-Preserving Supervised Fine-Tuning

arXiv:2605.10973v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight…

22
arXiv — Machine Learning research 15h ago

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies…

17
arXiv — NLP / Computation & Language research 15h ago

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

arXiv:2605.11290v1 Announce Type: new Abstract: Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most…

27
arXiv — NLP / Computation & Language research 15h ago

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

arXiv:2605.11317v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every…

33
arXiv — NLP / Computation & Language research 15h ago

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

arXiv:2605.12260v1 Announce Type: new Abstract: Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the…

8
arXiv — NLP / Computation & Language research 15h ago

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

arXiv:2605.12419v1 Announce Type: new Abstract: Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates…

24
arXiv — NLP / Computation & Language research 15h ago

fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

arXiv:2605.11403v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked…

38
r/LocalLLaMA community 21h ago

Is using vLLM actually worth it if you aren't serving the model to other people?

So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I happen to have an AMD GPU. The…

4
NVIDIA Developer Blog official-blog 1d ago

How to Eliminate Pipeline Friction in AI Model Serving

The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...

17
r/LocalLLaMA community 1d ago

Needle: We Distilled Gemini Tool Calling Into a 26M Model

We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted…

4
r/LocalLLaMA community 1d ago

New Qwen3.6 27b Autoround Quant (int4) Best Recipe

I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm. I decided to use a similar Autoround recipe but use the "autorund-best" preset instead, it uses more iterations to…

34
r/LocalLLaMA community 1d ago

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across each of 11 categories Models:…

17
vLLM releases dev-tools 3d ago

v0.20.2

vLLM v0.20.2 Highlights This release features 6 commits from 6 contributors (0 new)! This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL Bug Fixes DeepSeek V4 sparse attention : Re-enable the persistent topk path on Hopper and ensure the memset…

11
vLLM releases dev-tools 9d ago

v0.20.1

vLLM v0.20.1 This is a patch release on top of v0.20.0 primarily focused on DeepSeek V4 stabilization and performance improvements , along with several important bug fixes. DeepSeek V4 Base model support ( #41006 ). Multi-stream pre-attention GEMM ( #41061 ), configurable…

37
NVIDIA Developer Blog official-blog 13d ago

Speed Up Unreal Engine NNE Inference with NVIDIA TensorRT for RTX Runtime

Neural network techniques are increasingly used in computer graphics to boost image quality, improve performance, and streamline content creation. Approaches...

17
MIT News — AI research 14d ago

Enabling privacy-preserving AI training on everyday devices

A new method could bring more accurate and efficient AI models to high-stakes applications like health care and finance, even in under-resourced settings.

13
Smol AI News news-outlet 15d ago

not much happened today

**vLLM v0.20.0** introduces significant improvements in memory and MoE serving efficiency, including **TurboQuant 2-bit KV cache** for **4× KV capacity** and a **2.1% latency improvement**. The update supports multiple hardware platforms like **DeepSeek V4 MegaMoE on…

9
vLLM releases dev-tools 15d ago

v0.20.0

vLLM v0.20.0 Highlights This release features 752 commits from 320 contributors (123 new)! DeepSeek V4 : Initial DeepSeek V4 support landed ( #40860 ), with DSML token-leakage fix in DSV4/3.2 ( #40806 ), DSA + MTP IMA fix ( #40772 ), and a silu clamp limit on the shared expert (…

33
NVIDIA Developer Blog official-blog 22d ago

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...

31
NVIDIA Developer Blog official-blog 1mo ago

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...

17
NVIDIA Developer Blog official-blog 1mo ago

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak...

14
MIT News — AI research 1mo ago

AI system learns to keep warehouse robot traffic running smoothly

This new approach adapts to decide which robots should get the right of way at every moment, avoiding congestion and increasing throughput.

29
NVIDIA Developer Blog official-blog 1mo ago

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...

38
NVIDIA Developer Blog official-blog 1mo ago

Deploying Disaggregated LLM Inference Workloads on Kubernetes

As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages...

15
Hugging Face official-blog 1mo ago

Holotron-12B - High Throughput Computer Use Agent

Back to Articles Holotron-12B - High Throughput Computer Use Agent Team Article Published March 17, 2026 Upvote 22 Pierre-Louis Cedoz plcedoz38 Hcompany Hamza Benchekroun hamza-hcompany Hcompany Aurélien Lac h-aurelien-lac Hcompany delfosse aureliendelfosseathai Hcompany Tony Wu…

6
Smol AI News news-outlet 1mo ago

not much happened today

**Moonshot's Attention Residuals** paper introduced an input-dependent attention mechanism over prior layers with a **1.25x compute advantage** and less than **2% inference latency overhead**, validated on **Kimi Linear 48B total / 3B active**. The paper sparked debate on…

26
Smol AI News news-outlet 2mo ago

not much happened today

**NVIDIA’s Nemotron 3 Super** is a **120B parameter / ~12B active** open model featuring a **hybrid Mamba-Transformer / SSM Latent MoE** architecture and **1M context window**, delivering up to **2.2x faster inference than GPT-OSS-120B** in FP4 with strong throughput gains. It…

10
NVIDIA Developer Blog official-blog 2mo ago

Removing the Guesswork from Disaggregated Serving

Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal...

37
MIT News — AI research 2mo ago

New method could increase LLM training efficiency

By leveraging idle computing time, researchers can double the speed of model training while preserving accuracy.

13
NVIDIA Developer Blog official-blog 2mo ago

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as...

25
NVIDIA Developer Blog official-blog 2mo ago

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges...

30
Smol AI News news-outlet 3mo ago

Z.ai GLM-5: New SOTA Open Weights LLM

**Zhipu AI** launched **GLM-5**, an **Opus-class** model scaling from **355B to 744B parameters** with **DeepSeek Sparse Attention** integration for cost-efficient long-context serving. GLM-5 achieves **SOTA on BrowseComp** and leads on **Vending Bench 2**, focusing on office…

18
NVIDIA Developer Blog official-blog 3mo ago

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture...

31
Smol AI News news-outlet 3mo ago

ElevenLabs $500m Series D at $11B, Cerebras $1B Series H at $23B, Vibe Coding -> Agentic Engineering

**Google's Gemini 3** is being integrated widely, including a new **Chrome side panel** and **Nano Banana** UX features, with rapid adoption and a **78% unit-cost reduction** in serving costs. The **Gemini app** reached **750M+ MAU** in Q4 2025, nearing ChatGPT's user base.…

23
Smol AI News news-outlet 3mo ago

Context Graphs: Hype or actually Trillion-dollar opportunity?

**Zhipu AI** launched **GLM-OCR**, a lightweight **0.9B** multimodal OCR model excelling in complex document understanding with top benchmark scores and day-0 deployment support from **lmsys**, **vllm**, and **novita labs**. **Ollama** enabled local-first usage with easy offline…

28
Smol AI News news-outlet 3mo ago

Open Responses: explicit spec for OpenAI's Responses API supported by OpenRouter, Ollama, Huggingface, vLLM, et al

**OpenAI** launched the **Open Responses** API spec, an open-source, multi-provider standard for interoperable LLM APIs designed to simplify agent stacks and tooling. Early adopters like **ollama** and **vLLM** support the spec, while notable absences include **anthropic** and…

4
Smol AI News news-outlet 4mo ago

Meta Superintelligence Labs acquires Manus AI for over $2B, at $100M ARR, 9months after launch

**Manus** achieved a rapid growth trajectory in 2025, raising **$500M** from Benchmark and reaching **$100M ARR** before being acquired by **Meta** for an estimated **$4B**. The **vLLM** team launched a dedicated community site with new resources, while performance issues with…

30
Smol AI News news-outlet 4mo ago

not much happened today

**GLM-4.7** and **MiniMax M2.1** open-weight model releases highlight day-0 ecosystem support, coding throughput, and agent workflows, with GLM-4.7 achieving a +9.5% improvement over GLM-4.6 and MiniMax M2.1 positioned as an OSS Claude-like MoE model with 230B total parameters…

18
Eugene Yan research 38mo ago

How to Write Data Labeling/Annotation Guidelines

Writing good instructions to achieve high precision and throughput.

5

qwen3.6 just stops

Rotation-Preserving Supervised Fine-Tuning

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

Is using vLLM actually worth it if you aren't serving the model to other people?

How to Eliminate Pipeline Friction in AI Model Serving

Needle: We Distilled Gemini Tool Calling Into a 26M Model

New Qwen3.6 27b Autoround Quant (int4) Best Recipe

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

v0.20.2

v0.20.1

Speed Up Unreal Engine NNE Inference with NVIDIA TensorRT for RTX Runtime

Enabling privacy-preserving AI training on everyday devices

not much happened today

v0.20.0

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design

AI system learns to keep warehouse robot traffic running smoothly

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

Deploying Disaggregated LLM Inference Workloads on Kubernetes

Holotron-12B - High Throughput Computer Use Agent

not much happened today

not much happened today

Removing the Guesswork from Disaggregated Serving

New method could increase LLM training efficiency

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

Z.ai GLM-5: New SOTA Open Weights LLM

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

ElevenLabs $500m Series D at $11B, Cerebras $1B Series H at $23B, Vibe Coding -> Agentic Engineering

Context Graphs: Hype or actually Trillion-dollar opportunity?

Open Responses: explicit spec for OpenAI's Responses API supported by OpenRouter, Ollama, Huggingface, vLLM, et al

Meta Superintelligence Labs acquires Manus AI for over $2B, at $100M ARR, 9months after launch

not much happened today

How to Write Data Labeling/Annotation Guidelines