Nemotron-Labs-Diffusion from NVIDIA
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Model Overview Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model. Highlights
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B-Base https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B-Base https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B-Base [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.