vLLM releases · · 22 min read

v0.20.0

Mirrored from vLLM releases for archival readability. Support the source by reading on the original site.

vLLM v0.20.0

Highlights

This release features 752 commits from 320 contributors (123 new)!

  • DeepSeek V4: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix (#40772), and a silu clamp limit on the shared expert (#40950).
  • CUDA 13.0 default: Default CUDA wheel on PyPI and vllm/vllm-openai:v0.20.0 image switched to CUDA 13.0; architecture lists and build-args cleaned up (#39878), and CUDA bumped to 13.0.2 to match PyTorch 2.11.0 (#40669). As a general rule of thumb, our CUDA version policy follows PyTorch's. We highly recommend to install vLLM with uv and use --torch-backend=cu129 if you are on CUDA 12.9.
  • PyTorch 2.11 upgrade (#34644): vLLM ships on torch 2.11 for CUDA, and XPU is now also on torch 2.11 (#37947) — XPU is no longer pinned to 2.10. This is a breaking change for environment dependency.
  • Python 3.14: Added to the supported Python version list (#34770).
  • Transformers v5: vLLM now runs on HuggingFace transformers>=5 (#30566), with vision-encoder torch.compile bypass (#30518) and continued v4/v5 compat fixes including PaddleOCR-VL image processor max_pixels (#38629), Mistral YaRN warning (#37292), and Jina ColBERT rotary inv_freq recompute (#39176).
  • New large models: Hunyuan v3 (Hy3) preview (#40681) with HYV3 reasoning parser (#40713); Granite 4.1 Vision as a built-in multimodal model (#40282).
  • FlashAttention 4 as default MLA prefill: FA4 re-enabled as the default MLA prefill backend (#38819) with head-dim 512 and paged-KV support on SM90+ (#38835), plus an upstream FA4 sync (#38690).
  • TurboQuant 2-bit KV cache: New attention backend delivering 2-bit KV cache compression with 4× capacity (#38479), now with FA3/FA4 prefill support (#40092).
  • Online quantization frontend: New end-to-end online quantization frontend (#38138), with docs (#39736); experts_int8 consolidated into the FP8 online path (#38463); MXFP8 online quant moved to the new frontend (#40152).
  • vLLM IR: Initial IR skeleton with rms_norm op (#33825), OOT-platform kernel imports (#38807), gemma_rms_norm reworked on IR (#39014), and IR op testing/benchmarking infra added (#40167) — foundation for future kernel work.
  • Model Runner V2 advances: Eagle prefill full-CUDA-graph (#37588), auto-resolve cudagraph mode/sizes from attention backend (#32936), fused probabilistic rejection sample kernels (#38496), config validation for unsupported features (#38758), piecewise-fallback disabled for eagle draft decodes (#39773), multiple prompt-logprobs support (#39937), prefill warmup coverage (#40746), and a fix for accuracy regression caused by stale sampled/draft tokens (#39833).
  • MoE refactor series: Unquantized migrated to Full Oracle Flow (#36286), CT W8A8 to Oracle (#39187), SharedExperts class (#35153), SharedFusedMoE removed (#35782), DefaultMoERunner split (#35326) and later combined back into MoERunnerBase (#40560), shared/fused expert output sum moved into MoERunnerBase (#35949), ZeroExpertFusedMoE in new framework (#35549), compressed_tensors_moe.py split (#38960), GPTQMarlinMoEMethod reworked with MK (#37990), XPU & CUTLASS MoE relocated to fused_moe/experts/ (#40568, #40574), make_expert_params_mapping renamed (#40671), MoE LoRA refactor (#40338), and MoE DP chunking removed (#39107).
  • Performance: Optimize batch invariant with fused rms norm — 2.1% E2E latency improvement (#40413); avoid seq_lens_cpu GPU→CPU sync (#40654); cache InductorPass.hash_source (#39328); skip FX-graph deserialization on loading for faster warm compile (#40151); CUDAGraph memory profiling enabled by default for clearer startup memory accounting (#38284).

Model Support

  • New architectures: DeepSeek V4 (#40860), Hunyuan v3 preview (#40681), Granite 4.1 Vision (#40282), EXAONE-4.5 (#39388), BharatGen Param2MoE (#38000), Phi-4-reasoning-vision-15B (#38306), Cheers multimodal (#38788), telechat3 (#38510), FireRedLID (#39290), jina-reranker-v3 (#38800), Jina Embeddings v5 (#39575), Nemotron-v3 VL Nano/Super (#39747).
  • Gemma4 series: fast prefill (#38879), quantized MoE (#39045), Eagle3 (#39450), block-local attention + YaRN for Gemma3 (#39823), bidirectional vision attention for sliding layers (#40534), token-repetition fix via dynamic BOS (#39842), multimodal embedder norm-order fix (#40411), plus a string of streaming/tool-call fixes (#38844, #38909, #38992, #39114, #39679, #39027).
  • Quantization formats: GGUF support for MiniMax-M2.1 (#36965), non-standard GGUF quant types with prefix such as UD-IQ1_S (#39471).
  • Speculative decoding: Eagle3 for MiniMax-M2 (#37512), Eagle3 for Gemma4 (#39450).
  • LoRA: Qwen3ASRForConditionalGeneration (#37247), Gemma4ForConditionalGeneration (#39291, #38844), DeepSeek V3.2 (#35077), Qwen3.5 / Step3.x expert base_layer extension (#37114), MoE LoRA refactor (#40338), dual-CUDA-streams linear layer (#35721).
  • Multimodal MRoPE refresh: mm_features-based MRoPE for Ernie-4.5 VL (#39753), Keye-VL / Keye-1.5-VL (#39869), PaddleOCR-VL (#39888).
  • Other: Nano-Nemotron-VL static image inputs fix (#40724); Qwen3 MoE no longer calls gate twice (#40664); DeepSeek V2-Lite accuracy drop fix (#40673); Parakeet UX / perf enhancements (#39423); ColModernVBERT updated for latest HF checkpoint (#39307); NemotronH default mamba_ssm_cache_dtype=float32 with NemotronHNanoVLV2 auto-hook (#39032); new TP plan styles for the Transformers backend (#40467); GLM-5.1 fix on ROCm (#40763).

Engine Core

  • Model Runner V2: Full CUDA graph for eagle prefill (#37588), auto cudagraph mode/sizes based on attention backend (#32936), fused probabilistic rejection-sample kernels (#38496), config validation (#38758), eagle-draft piecewise fallback disabled (#39773), multiple prompt logprobs (#39937), prefill warmup coverage (#40746), stale sampled/draft tokens accuracy fix (#39833).
  • vLLM IR: IR skeleton + rms_norm (#33825), OOT kernel import hooks (#38807), gemma_rms_norm on IR (#39014), IR op testing/benchmarking infra (#40167).
  • torch.compile: Opaque Objects on torch 2.11 (#39286), AOT compile with batch-invariance mode (#39201), Inductor cache nested under AOT dir (#39718), split FX graph via codegen (#38657), Inductor pre-grad passes re-enabled for torch≥2.12 (#38944), strings in custom ops without compile regressions (#38123), MLA + group FP8 fusion (#38877), SiluMul activation+quant fusion refactor (#39684), donate_graph_module=True for standalone_compile (#39733), skip FX graph deserialization on loading (#40151), include Inductor & functorch configs in compile-cache key (#40627), respect TORCH_COMPILE_DISABLE at vLLM config level (#40715), disable Sequence Parallelism for piecewise compilation (#38373).
  • Attention: FA4 as default MLA prefill (#38819), head-dim 512 + paged-KV on sm90+FA4 (#38835), FA4 upstream sync (#38690), full CUDA graph for FlexAttention (#36298), FlexAttention non-causal support (#40394), unified 2D/3D triton_unified_attention (#40631), TRTLLM minimax_allreduce_rms ported (#37045), concat_mla_q half-types only (#37892), batch-invariance-aware backend auto-selection (#40193), avoid seq_lens_cpu GPU→CPU sync (#40654).
  • Helion kernels: torch.compile support for Helion kernels (#38592).
  • HMA / KV offload: GPU-side KV events for HMA (#37688), group block hashes/IDs tracked (#37109), unified memory layout for offloading workers (#37206), shutdown() on OffloadingConnector (#39182), request context passed through KV offload (#39185), sliding-window lookup (#36645), multi-group worker transfer (#38453), multi-KV-group lookup/load/store (#39401, #39402, #39403).
  • Features: NUMA binding for GPU workers (#38635), opt-in VLLM_MEDIA_CACHE media URL caching (#37123), safe request abort when FSM fails to advance (#38663), KV connector prioritized over internal registry (#38301), CUDAGraph memory profiling on by default (#38284), shared-expert overlap restored (#39222), CONFIG_REGISTRY config-class lookup fix when on-disk model_type differs (#39554), workspace-resize GPU memory leak fix (#39226), SWA/chunked-local runtime admission capped to startup pool-sizing bound (#40946).
  • Pluggable layers: Applied to llm_head / vocab embedding (#33465) and MoE layers (#33556).
  • Mamba: Stochastic rounding (#35753), different Conv state layouts (#37416), FlashInfer selective_state_update (#36162).
  • Metrics & scheduling: Labeled waiting-breakdown (capacity/deferred) metric (#38435), API server handshake simplified (#39364), mm-scheduler get_num_embed overhead reduced (#40143), request_id on FinishedRequestStats (#39710).
  • Executor: RayExecutorV2 introduced (#36836); unified engine process monitoring with Ray backend (#35862).

Hardware & Performance

  • NVIDIA: swapAB support for SM120 CUTLASS blockwise FP8 GEMM (#38325), MXFP4 W4A4 CUTLASS MoE for SM100 (#37463), TRTLLM GEN NVFP4 MoE with non-512-aligned hidden dims via weight padding (#39510), TRTLLM FP8 MoE with shuffled weights + BlockMajorK layout (#38993), fused qknorm+rope kernel on SM9.0 (#37376), tuned fused_moe config for RTX PRO 6000 Blackwell (#39183), ViT full CUDA graph for Qwen3-VL video (#38061), --enable-vit-cuda-graph for VLM examples (#40580), default max_frames_per_batch auto-infer for ViT CG video (#40445), fused FP8 output quantization into merge_attn_states (#36518), batched KV-cache swap via cuMemcpyBatchAsync (#38460), sm_110 (Jetson Thor) added to CUDA 13.0 build targets (#39233).
  • AMD ROCm: ZenCPU / AMD Zen CPU backend via zentorch (#39967), RDNA 3.5/4 device IDs (gfx1150/1151/1201) (#38455), gfx1102/gfx1103 added (#40037), MORI EP for unquantized MoE with AITER (#37529), MoRI build with AMD AINIC stack (#38371), MoRI-IO message format aligned with P2pNcclConnector and vllm-router (#39565), MORI prefill/decode API correction (#39835), AITER gemm w8a8 ptpc integration (#33773), TritonW4A16LinearKernel (#37352), asymmetric INT8 in TritonInt8ScaledMMLinearKernel (#38501), fused_silu_mul_block_quant enabled (#38817), KV-cache shuffle for paged_attention_common (#32914), MLA decode output zero-fill removed in AITER (#37539), MLA dual RMS norm fusion pass for DeepSeek/Kimi-K2 (#39242, with older-AITer guard #40386), AITER MLA + Eagle3 spec decode (#39616), DFlash on ROCm (#39703), wvSplitK FP8 path for RDNA (#37712), GPU↔NUMA-node detection (#40015), non-causal attention in ROCM_ATTN (#40176), engine-shutdown GPU memory leak fix (#38503), score-correction-bias dtype cast for DeepSeek/Kimi-K2 (#39999).
  • Intel XPU: torch 2.11 upgrade for XPU (#37947) — no longer pinned to 2.10, initial GDN attention for Qwen3-Next / Qwen3.5 (#33657), torch.compile for XPU GDN attention (#39466), XPU MXFP8 quant op (#38682), XPU MXFP4 quant op (#39857), per-channel FP8 linear (#38316), FP8 KV cache on XPU (#37731), round_int8 for Intel Triton (#38825), MoE Triton in online FP8 quantization fix (#40109), current_platform.supports_fp8() updated for TritonExperts (#40132), NIXL import on XPU fix (#40430), fusion-pattern support disabled on XPU (#39789).
  • CPU: CPU draft-model speculative decoding (#32662), CPU int8 compute mode in AWQ (#35697), head_size 512 in cpu_attn (#38676), gelu in cpu_fused_moe (#38770), OMP replacement (#36487), BF16 GELU LUT on ARM (#37469), W4A16 Autoround on CPU (#38192), CPU affinity/memory mgmt refactor (#39781), IBM Z s390x torch 2.11 builds (#39910), faster exp routine for lower-precision dtypes (#38112), inter-node pipeline parallel fix (#40150), RISC-V multiple RVV VLEN targets (#39478), RISC-V platform detection fix (#40427), exp() input clamp to prevent NaN on CPU/RISC-V (#40428).
  • TPU: tpu-inference upgraded to 0.18.0 (#40395).
  • DeepSeek / MLA / Indexer: Persistent TopK scheduler for DSV3.2 DSA decode (#37421), DSV3.2 indexer fused weights projection (#38684), Triton MLA perf fixes (#33529), indexer WK upcast to BF16 for fusion (#38928), MLA indexer uniform-decode optimization for MTP>1 (#39458), DSA + MTP IMA fix (#40772).
  • GDN / Mamba: Kernel fusion in GDN (#37813), TMA aligned with upstream FLA (#38981), GPU↔CPU syncs eliminated in prefill and spec-decode paths (#38361, #38047).
  • Other: DeepGEMM integrated into the vLLM wheel via CMake (#37980), Lustre FS checkpoint prefetching enabled by default (#39422), Gemma4 fused routing Triton kernel (#39083), Gemma4 embed_input_ids GPU/CPU sync removed (#39234), Nemotron VL image/video preprocessing optimized (#40283), SiLU block-quant fusion v1 (#32996), bilinear_pos_embed Triton kernel for ViT (#37948), mean-pooling optimization (~5.9% throughput) (#38559), redundant-sync removal for pooling (~3.7% throughput) (#39113), H2D pageable-memory copy reduction (#38794), fused zero initializer for FP8 DeepGemm block-quant (#39547), batch-invariant fused-rms-norm 2.1% E2E latency improvement (#40413), InductorPass.hash_source cached (#39328), humming quantization kernel (#34556).

Large Scale Serving

  • EPLB: Alternative communication for EPLB weight exchange (#33176), nixl-based EPLB communicator (#36276), mapping optimization with router record for prefill (#36261), TransferMetadata consolidation (#37341), Async EPLB synchronization refactor (#37601), asyncio infrastructure removed from Async EPLB (#40730), replica-selection bias fix in fused_moe router (#40810), Async EPLB integration test added (#40168).
  • WideEP: Naive all2all replaced by allgather + reducescatter (#33728).
  • KV Offload / Connector: 3FS KVConnector (#37636), unified memory layout for offloading workers (#37206), cache_salt propagated through MP connector for per-user isolation (#39837), multi-connector metrics of same type (#40010), LMCache block-allocation event (#38856), LMCache MP save optimization with MLA (#38810), num_lmcache_extra_cached_token in KVTransferParams (#39843), offload all KV blocks during prefill in P/D (#40346), DP control bundle pinned to first GPU's node on Ray (#39167), FlashInfer NVLink MNNVL workspace sized to EP group (#40893).
  • Disaggregated / NIXL / Mamba: Full PD support for Mamba2-like models on Heterogeneous TP deployments (#37635), Nixl bumped to 0.10.1 (#39922), TpKVTopology + HeteroTPTransferConfig unified into TransferTopology (#39529), NIXL EP treated as batched experts in fused_moe (#40412).

Quantization

  • New formats & methods: TurboQuant 2-bit KV cache compression (#38479) with FA3/FA4 prefill (#40092), per-token-head INT8/FP8 KV cache quantization (#38378), fused FP8/NVFP4 output quantization in MLA attention (#35792), NVFP4 dense models on MI300/MI355X and Hopper via emulation (#35733), NVFP4 MoE emulation fallback for H100/MI300/MI350 (#35737), humming quantization kernel (#34556).
  • Kernels: MXFP8 in Marlin GEMM/MoE with Mxfp8LinearOp refactor (#34664), MXFP4 W4A4 CUTLASS MoE for SM100 (#37463), NVFP4 in reshape_and_cache_flash (#37332), batch-invariant NVFP4 linear (#39322), FlashInfer CuteDSL batched-experts backend for NVFP4 MoE (#38251), special GptOssMxfp4MoeMethod (#39604), W4A8_FP8 MoE TP>1 correctness fix (#40310), NVFP4 CUTLASS MoE OOB-read fix for non-multiple-of-4/16 expert counts (#40351), RMS norm + quant fusion fix on DeepGEMM UE8M0 path for B200 (#40552), Gemma4 quantized MoE (#39045).
  • Compressed tensors: W8A8 MXFP8 linear/MoE (CompressedTensorsW8A8Mxfp8) (#38815), CT W8A8 in Oracle structure (#39187), layerwise reloading of attention/KV quantized models (#38995), experts_int8 consolidated with FP8 online quant (#38463), MXFP8 online quant on the new frontend (#40152).
  • Online quant: Quantized model init failure fix with prefetch offloading (#40432), current_platform.supports_fp8() updated for TritonExperts on XPU/ROCm (#40132).
  • XPU / CPU / AMD: XPU MXFP4 (#39857), XPU MXFP8 GEMM + compressed-tensor schema (#38707), XPU FP8 per-channel linear (#38316), FP8 KV cache on XPU (#37731), CPU W4A16 Autoround (#38192), XPU W4A16 Autoround (#37986), asymmetric INT8 TritonInt8ScaledMMLinearKernel on ROCm (#38501), Quark W8A8 INT8 MoE inference (#36320).
  • Deprecations: Petit NVFP4 removed (#32694).

API & Frontend

  • OpenAI / Anthropic API: presence_penalty / frequency_penalty on Responses API (#38613), Responses API streaming migrated to unified parser (#38755), tool_choice / tools validation on Responses to match OpenAI (#40399), Mistral Grammar factory (#38150), multimodal support on /inference/v1/generate (#38405), max_tokens_per_doc in rerank (#38827), Generative Scoring (#34539), MaxSim re-enabled on GPU (#38620), chat_template_kwargs on Anthropic /v1/messages (#40125), auto-detection of reasoning_config when only reasoning_parser is set (#38214), reasoning parsers can access model config via adjust_request (#37848, #39027), effective chat-template kwargs passed to reasoning parsers (#40460), reasoning parsers expose reasoning_start_str/reasoning_end_str (#40566).
  • Pooling ecosystem: Pooling entrypoints overhauled across scoring (#28631), pooling (#39153), and cleanup (#39675); preprocessing/postprocessing offloaded to thread pool (#39763); async scheduling disabled by default for pooling (#39592); logit_scale added to PoolerConfig (#39435), then renamed logit_bias/logit_scalelogit_mean/logit_sigma for affine score calibration (#39530) — breaking. LLM.reward deprecated; use LLM.encode instead (#40688).
  • gRPC / streaming: Streaming on token-generation endpoint (#37171); gRPC periodic stats logging + servicer log forwarding (#38333); standard grpc.health.v1 health check for Kubernetes-native probes (#38016).
  • Tool / reasoning parsers: Treat <tool_call> as implicit reasoning end in Qwen3 (#35687), is_reasoning_end_streaming() override for GptOssReasoningParser (#35745), Mistral tool parser HF-tokenizer fix (#39294), Mistral pre-v11 tool parser trailing-output fix (#40531), Gemma4 streaming HTML duplication / JSON corruption / null-as-string fixes (#38909, #38992, #39114, #39679), HF tokenizer concurrent-borrow fix in tool parsers (#40059), HYV3ReasoningParser no longer mutates chat_template_kwargs (#40713).
  • Multimodal: Externally processed mm_kwargs with cache injection (#39502), PyAV video backend for concurrent decoding (#39986), custom video metadata for pre-extracted frame sequences (#40133), image+video mixed inputs (per prompt) for VLM examples (#40335), deepstack buffer optimized for Qwen3 multimodal (#40145), readonly multimodal processor warmup during renderer startup (#40797), mm_processor_kwargs forwarded in offline generate APIs (#40251), normalize malformed dict prompts that carry token IDs in prompt (#40339), hotwords for FunASR (#39674), bundle get_generation_prompt() params into SpeechToTextParams (#36268).
  • Frontend / vLLM Omni: --omni delegates to vLLM Omni (#40744); avoid eager import of mistral_common (#40043).
  • LLM / CLI: Structured-output special tokens preserved in offline LLM.chat (#39352), use_audio_in_video passable at vllm serve for nemotron-nano-vl (#38538), deferred imports save ~2s CLI startup (#40056), improved MM-input-too-long error message (#39409), warning when FP8 KV cache misses prefill query quant (#39752), clearer DCP error message (#28443), --model deprecation warning updated (#39518), Mimo reasoning/tooling parsers mapped (#40089), human-readable k/K/m/M… suffix in JSON CLI args (#40473).

Spec Decode

  • Eagle3 for MiniMax-M2 (#37512), Eagle3 for Gemma4 (#39450), AITER MLA + Eagle3 on ROCm (#39616).
  • TurboQuant FA3/FA4 for prefill paths (#40092).
  • Mamba: default to 'align' cache mode for Mamba-based models when speculative decoding is enabled (#40454).
  • Unified Synthetic Acceptance Rate for V1 and V2 (#40662); SpecDecodeBaseProposer moved out of eagle.py (#40732); DSA + MTP IMA fix (#40772).

Security

  • SSRF fix in batch runner download_bytes_from_url (#38482).

Dependencies

  • PyTorch 2.11 for CUDA (#34644) and XPU (#37947) — XPU no longer pinned to 2.10.
  • CUDA 13.0 default with updated architecture lists and cleaned build-args (#39878); CUDA bumped to 13.0.2 to match PyTorch 2.11.0 (#40669); sm_110 (Jetson Thor) added (#39233).
  • Python 3.14 added to supported versions (#34770).
  • Transformers v5 (#30566), with vision-encoder torch.compile bypass (#30518) and continued v4/v5 compat fixes.
  • FlashAttention 4 upstream sync (#38690) and symlink-on-install behavior (#38814).
  • FlashInfer bumped to 0.6.8 (#39959).
  • AITER triton BUFFER_OPS fix + version updates (#38580), AITER reverted to v0.1.10.post3 (#39509); Nixl bumped to 0.10.1 (#39922) and pinned per CUDA major in CI (#39851); DeepGEMM integrated into the wheel via CMake (#37980); fastsafetensors added to NVIDIA Dockerfile (#38950); Helion bumped 0.3.2 → 0.3.3 (#38062).
  • Removed / moved: resampy dependency dropped (#39524), librosa direct dependency dropped (#39079), pyav and soundfile moved to common requirements (#39997).

Breaking Changes

  1. PyTorch 2.11 + CUDA 13.0(.2) default — environment dependency change, now applied to XPU as well.
  2. Transformers v5 is the supported baseline (#30566).
  3. Metrics rework: vllm:prompt_tokens_recomputed removed (#38709); num_cached_tokens / num_external_computed_tokens replaced with PrefillStats (#37460).
  4. Pooler config rename: logit_bias/logit_scalelogit_mean/logit_sigma (#39530).
  5. Async scheduling default OFF for pooling models (#39592).
  6. CUDAGraph memory profiling now ON by default (#38284) — startup memory accounting changes.
  7. Petit NVFP4 quantization removed (#32694); LLM.reward deprecated, use LLM.encode (#40688); cprofile / cprofile_context deprecated (#39100); V0 accept output buffer deprecated (#39125).

V0 Deprecation

  • Petit NVFP4 (#32694), accept output buffer in attention (#39125), cprofile / cprofile_context (#39100), LLM.reward offline API (#40688).

New Contributors

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from vLLM releases