r/MachineLearning · June 25, 2026 · 3 min read

[R] All Routes Lead to Collapse: attention sinks, representation collapse, and norm stratification are what content-based routing does under a norm-blind metric

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

[R] All Routes Lead to Collapse: attention sinks, representation collapse, and norm stratification are what content-based routing does under a norm-blind metric

I've been working on a project that started with what I thought was a transformer problem.

People usually talk about attention sinks, representation collapse, low-rank activations, weird key norm distributions, etc. as separate attention pathologies that need separate fixes.

I don't think they're actually about attention.

I think they're what any content-based router does when it's making decisions with a similarity metric that's blind to magnitude.

The observation that kicked this off is surprisingly simple.

You can rewrite softmax attention as a Boltzmann distribution over Euclidean distances only if all the keys have the same norm.

Expanding the distance,

||q-k||² = ||q||² - 2<q,k> + ||k||²

The query norm disappears inside the softmax. The key norm also disappears, but only when every key has roughly the same magnitude.

Standard attention just throws the key norm term away regardless.

That means it's routing using a metric that can't "see" key magnitude.

Once I started looking for that assumption in real models, it was violated basically everywhere.

My hypothesis became:

If your routing metric is blind to magnitude, the model has to compensate somehow.

And that compensation consistently shows up as:

routing concentrating onto a few positions,
representations collapsing into a low-rank subspace,
key norms becoming highly stratified.

Those aren't three unrelated phenomena, hey're different symptoms of the same geometry.

The cool part is that it isn't just transformers.

I looked at five different routing mechanisms.

Transformers: 9 pretrained models (GPT-2 Small → XL, Pythia 160M → 2.8B). Every single one develops the same signature.

GATs: Compared graph attention against depth/width-matched GCNs on three heterophilic WebKB graphs. The attention models collapse more than the fixed-aggregation controls.

Mamba: No explicit attention here, but you can reconstruct the hidden routing operator. The effective "key" ends up being Δ·B. If I freeze Δ while keeping everything else fixed, the concentration almost completely disappears. So the selective routing is what's creating the effect.

RWKV: This one surprised me. If I sweep the learned time decay, the depth where concentration starts shifts dramatically. Strong decay delays it, weak decay makes it happen much earlier. So the decay acts like a positional brake sitting on top of the same content attractor.

AttnRes (Qwen3 variant): Probably my favorite result. It routes over depth instead of tokens, and its keys are RMS-normalized, so key norm variation is literally zero by construction.

It still develops strong routing hubs.

That was the moment where I stopped thinking norm stratification was the cause.

It's just one way a router can compensate.

Across all of these architectures, what changes isn't whether collapse happens, but when and how strongly.

Those seem to be controlled mostly by whatever positional bias or decay mechanism the architecture already has (RoPE, time decay, recency bias, etc.).

The paper is about 20 pages including the appendix. It has the measurement details, null baselines, causal ablations, retraining controls, and some converging evidence from recent work (AttnRes, QKV sharing, memory caching, etc.).

I'd love feedback, especially from people who've worked on attention, state-space models, or graph transformers.

Code: https://github.com/parzi-val/all-routes-lead-to-collapse

Paper: https://arxiv.org/abs/2606.22325

submitted by /u/entropy_-
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning