Why not dynamic active parameters (and other questions for the knowledgeable)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Why do we have to choose between MoE or Dense models? Wouldn't it be possible to have a model where the user can select the number of active parameters? If the user chooses them all, it is dense.
So based on a task, a user could decide how many active parameters it needs. Or even automate some scripts to find the best relation for that specific task.
Or it could happen automatically: depending on the difficulty of the task, the model could decide how many active parameters it needs.
If I need the most intelligence possible, I could trade in speed. But If I need speed, I could trade on intelligence. Without having to load several models at once to the RAM (which usually I can't).
In the same direction, if for some tasks I need speed and not intelligence, wouldn't it be possible to use the MTP part of the model alone? Instead of using it to predict for the rest of the model, couldn't the MTP part just answer directly to save on time and compute on some tasks?
The third question is why cannot a model modify its weights on the run to really learn from failures. Everytime a model hits the same error several times, and has to do tests or even research until finding a solution, it gets a very valuable information: it discovered something where it is bad at, and found how to do it properly. Of course, you can ask the model to vomit that learning into a doc.md, or even create an extension that does that automatically (I asked pi with qwen3.6 35b to extend itself for that, and it created a tool that captures errors in the tool calling).
But each time the model reads that docs.md, it consumes tokens, time, etc. It is already one turn of the many it has to do in an agentic task. If some command flag doesn't exist and it learns how to properly use it within a chat, it is a pity it forgets that with each new session.
I have the intuition that all my questions are stupid (maybe MoE and dense are trained differently, the training is different for the number of active parameters, MTP can never work as a standalone model, or changing the weights on the fly would end on chaos, a model that is not stable over time for fixed workflows, or even loses its agentic capabilities because the training was on long chains of thought). But still, I would be happy if someone with more knowledge could explain about this things, to get a deeper understanding.
Cheers!
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.