so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models).
https://oval-shell-31c.notion.site/Open-dLLM-Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a
I forked the codebase - ran it through opencode with free deepseek-flash / GLM5.1 overnight to upgrade to support qwen3.6 - because codebase is > 6 mths old - i got AI to mash up LDLM a most recent paper in the mix https://arxiv.org/pdf/2605.07933v1 Viacheslav Meshchaninov1 , Alexander Shabalin1 , Egor Chimbulatov2 , Nikita Gushchin3,4, Ilya Koziev5 , Alexander Korotin3,4, Dmitry Vetrov1 - these guys spent 3 years working on getting this paper working.
https://x.com/Viacheslav91112/status/2054613430082957443?s=20
I asked it to build config for qwen 3.6 model + upgrade with LDLM and spit ball some numbers on outputs with "honest" assumptions - big one is sequence length - throughput likely to fall off with higher outputs.
Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)
| Model | Dim | Trainable Params | Diffusion Steps | Throughput |
| Qwen3.6-35B-A3B | 2048 | 1.39B | 10 | 3,238 tok/s |
| Qwen3.6-35B-A3B | 2048 | 1.39B | 4 | ~6,500 tok/s |
| Qwen3.6-27B | 5120 | 6.75B | 10 | 745 tok/s |
| Qwen3.6-27B | 5120 | 6.75B | 4 | ~1,500 tok/s |
Assumptions & Caveats
- Untrained weights: These benchmarks use randomly initialized Perceiver/decoder/diffusion-head weights. A trained model will have identical throughput but produce coherent output. Quality benchmarks (perplexity, HumanEval) will be published after training completes.
- No encoder in the loop: The frozen Qwen3.6 encoder is not used during generation — it's only needed for training (to produce latent targets). At inference, the diffusion head denoises random noise, then the Perceiver decoder maps latents to tokens. The encoder is deleted before benchmarking (
del autoencoder.token_encoder). - Seq len = 64: The benchmark uses a short sequence length (64 tokens). Longer sequences will reduce throughput proportionally. The 4-step throughput numbers are linear extrapolations from the 10-step measurements.
- Batch size = 1: Single-sequence generation only. Throughput scales near-linearly with batch size for the 35B-A3B (dim=2048 fits easily in VRAM), less so for the 27B (dim=5120).
- CPU RAM requirement: While the encoder is not used at inference, it must fit in system RAM during training (~54GB for 27B, ~22GB for 35B-A3B in bf16). The Qwen3.6 architecture uses Triton kernels (flash-linear-attention) that cannot run on CPU, so the encoder forward pass during training requires GPU offloading — a multi-GPU setup is recommended for training.
- Qwen3.6 requires
trust_remote_code=True: The model uses custom architecture code (Qwen3_5ForConditionalGeneration) that is not in standard transformers releases. Ensure your transformers version supports it (>=4.54). - 35B-A3B is MoE: Only 3B of its 35B parameters are active per token, giving it a much smaller hidden dim (2048) than the 27B dense model (5120). This is why the LDLM trainable components are 5x smaller and 4x faster.
- Not an apples-to-apples comparison with AR models: The diffusion model generates all tokens in parallel across N diffusion steps, while AR generates one token at a time. The "tok/s" metric favors diffusion for short sequences but does not reflect output quality, which depends on training convergence.
Code is here - with git issues enabled
https://github.com/scrya-com/Open-dLLM
wandb training metrics
https://wandb.ai/snoozie/Qwen3.6-35B-A3B-LDLM?nw=nwusersnoozie
If anyone has spare vast.ai credits / azure credits / google credits hook me up
submitted by
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.