qwen 3.6 27B AR-> Diffusion - local training on 5090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get the thing to do a forward pass on a 5090 with help of another gpu rtx4000 to help offload recreations. Below are some low level ramblings / findings / observations. Firstly - the amount of vram normally required to do this > 600gb - (i think) after some wrangling - and giving up on optane route - it's possible to train on qlora form factor which will actually take the model and train on nvidia - nvfp4 i attempt to get the entire 27b model to train on a 5090 https://github.com/scrya-com/dLLM-castlehill latest training run https://wandb.ai/snoozie/open-dllm-27b/runs/arcefpjp?nw=nwusersnoozie Public service annoucment - to avoid burning cables - throttle down nvidia max power for consumer 5090 cards from 600w -> 400w The vanilla route with open-dllm is validated on qwen 2.5 with 4x speed up (if someone with lots of compute could take a look it might just work) - I take some deviation to explore improving this - and found a few papers. One is d3llm Ultra-Fast Diffusion LLM https://github.com/hao-ai-lab/d3LLM which boasts faster diffusion speeds - so i upstream this code into the codebase and include their mdm loss - seems ok. It's basically also taking the order of the tokens into account. With the diffusion it can have many steps (see graph) but we can shorten that time to see much higher throughput / tokens per second. if we could theoretically do 1 step - then you may see some crazy speeds. https://wandb.ai/snoozie/open-dllm-compare?nw=nwusersnoozie When i was working on improving ltx2 to speed up video recreation to do 1 shot diffusion - I attempt to implement this trick shot based off a paper variational flow maps which / make some noise see here https://github.com/johndpope/ltx2-castlehill https://wandb.ai/snoozie/vfm-v4a?nw=nwusersnoozie This was built to do 1 step image generation by basically crafting noise that almost looks like the image. In a similiar way - this can be done with the text to help reduce the steps of denoising. VFM https://github.com/scrya-com/dLLM-castlehill/issues/2 https://github.com/pengzhangzhi/Open-dLLM/issues/31 UPDATE 1) for open-dllm - you have to calculate the anchors from the teacher model - 64 layers from some response. or 2) for the d3llm - we calculate the trajectories and use for training. there's helper scripts to do both - the agents / claude would help any claude / grok. I'm enjoying opencode.ai - you can get a long way for very little expense - im on the $5 /mth plan https://opencode.ai/go?ref=7C4F1XYS01 [link] [comments] |
More from r/LocalLLaMA
-
China Expands Travel Curbs to Top AI Talent at Private Firms
May 26
-
SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery
May 26
-
Qwen3.5 27B Uncensored Heretic Native MTP Preserved is Out Now With the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats!
May 26
-
Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.
May 26
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.