r/LocalLLaMA · May 26, 2026 · 2 min read

qwen 3.6 27B AR-> Diffusion - local training on 5090

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

qwen 3.6 27B AR-> Diffusion - local training on 5090

based on the work of open-dllm -

(which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.)

TLDR

I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get the thing to do a forward pass on a 5090 with help of another gpu rtx4000 to help offload recreations.

Below are some low level ramblings / findings / observations.

Firstly - the amount of vram normally required to do this > 600gb - (i think)

after some wrangling - and giving up on optane route - it's possible to train on qlora form factor which will actually take the model and train on nvidia - nvfp4

i attempt to get the entire 27b model to train on a 5090

https://github.com/scrya-com/dLLM-castlehill

latest training run

https://wandb.ai/snoozie/open-dllm-27b/runs/arcefpjp?nw=nwusersnoozie

Public service annoucment - to avoid burning cables - throttle down nvidia max power for consumer 5090 cards from 600w -> 400w

The vanilla route with open-dllm is validated on qwen 2.5 with 4x speed up (if someone with lots of compute could take a look it might just work) - I take some deviation to explore improving this - and found a few papers. One is d3llm Ultra-Fast Diffusion LLM https://github.com/hao-ai-lab/d3LLM which boasts faster diffusion speeds - so i upstream this code into the codebase and include their mdm loss - seems ok. It's basically also taking the order of the tokens into account.

With the diffusion it can have many steps (see graph) but we can shorten that time to see much higher throughput / tokens per second. if we could theoretically do 1 step - then you may see some crazy speeds.

https://wandb.ai/snoozie/open-dllm-compare?nw=nwusersnoozie

When i was working on improving ltx2 to speed up video recreation to do 1 shot diffusion - I attempt to implement this trick shot based off a paper

variational flow maps which / make some noise
https://arxiv.org/abs/2603.07276

see here

https://github.com/johndpope/ltx2-castlehill

https://wandb.ai/snoozie/vfm-v4a?nw=nwusersnoozie

This was built to do 1 step image generation by basically crafting noise that almost looks like the image.

In a similiar way - this can be done with the text to help reduce the steps of denoising. VFM

https://github.com/scrya-com/dLLM-castlehill/blob/255d13ae45300f6e4aee69f46ba57bbb32df2b8b/tasks/train_vfm.py#L37

https://github.com/scrya-com/dLLM-castlehill/issues/2

https://github.com/pengzhangzhi/Open-dLLM/issues/31

UPDATE
the readme is bloated from the upstream (sorry just skip to the qwen .36 stuff) - but the gist of continuing any of this work -

1) for open-dllm -

you have to calculate the anchors from the teacher model - 64 layers from some response.

2) for the d3llm -

we calculate the trajectories and use for training.

there's helper scripts to do both - the agents / claude would help any claude / grok. I'm enjoying opencode.ai - you can get a long way for very little expense - im on the $5 /mth plan https://opencode.ai/go?ref=7C4F1XYS01

submitted by /u/Revolutionary_Ask154
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA