Be wary of Qwen/Claude distillations - they're often worse than the base model
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Just to be clear; I am not attempting to call anybody out or be mean to those who take the time/money to make these models, I just want to inform people about these distills/finetunes since there's clearly some confusion going on.
I'm going to assume those of us who often visit this subreddit have noticed these models, particularly the "Qwopus" model and the such, though I'm sure there's probably Gemma 4/Claude distills too. As I type this, there's currently a Qwen 3.6 based Claude Fable 5 distillation model on the frontpage. Seems pretty cool, right?
Yep. Up until you actually look into how these models were distilled. This new Fable distillation uses around 4,000 samples of Fable 5/Opus 4.8 to finetune Qwen 3.6 on. 4k samples is basically nothing when it comes to improving a models quality/performance. At best, it'll act slightly differently. But it certainly won't perform better than just running standard Qwen 3.6. If anything, it's actually likely to slightly degrade quality.
Why? 4K samples is just not enough. And I am aware that Qwopus (or it may be another finetune called Qwen3.6-Claude-Opus.4.6-Distill iirc) has a version with ~8-10k samples used for the training rather than the 3-4K. Unfortunately that's still nowhere near enough to be actually meaningful.
If anybody remembers the original DeepSeek-R1 LLaMa/Qwen distillations that were released by deepseek offiically back when the model first came out, around ~700,000 samples from R1 was used to create those distills. That's enough to not only impact behaviour, but actually improve benchmark scores.
So, these Qwen + Claude models will have a slightly different reasoning style. They might feel "more Opus-like" chatting wise. But they are not performing better than their base Qwen models, and based on everything I've seen, a lot of people seem to think that's the case. Even with that Qwen/Opus distill that uses like 10K+ samples, that's still just not enough to transfer any sort of actual capability. There's a decent example of someone testing this, showing Qwopus hallucinating compared to the standard Qwen 3.6, and also taking twice the amount of time. - there's also ofc plenty of people on this sub who have posted similar results.
So yeah, just something to be aware of whenever you come across these distills/finetunes. At the very least, don't blindly trust them to be superior and bench them on your own specific usecases. I've personally tried a couple of these finetunes and both of them had issues with coherence and subtle mistakes that the standard model didn't have. But YMMV.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.