r/MachineLearning · · 12 min read

Got told my open-source model experiments are too scattered. I'm organizing a journal to provide clarity before structuring the first git release. Is this readable for ML folks who aren’t in mech interp? Open to ANY feedback [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

# Results Journal: Qwen3.5-35B-A3B E114 as a Generated-Register Routing Signal

Date: 2026-06-06

This is an experiment-history document, not a publication claim. It states the current best evidence for the strongest positive result in the Qwen3.5-35B-A3B set, the narrow interpretation that evidence actually licenses, and the caveats that keep it honest.

Terms (PLEASE READ THESE FIRST)

  • Qwen3.5-35B-A3B — a routed MoE family with router-emitting expert layers. The analyses here read the layers that emit MoE router logits.

  • MoE router — the part of a mixture-of-experts model that chooses which experts handle each token.

  • Router logits / top-8 routing — the router scores 256 experts. The reconstruction computes a dense softmax over all experts, selects the top 8, then renormalizes within that selected set.

  • Expert 114 / E114 — one routed expert. The result here is specifically about Expert 114 at layer 14 during generated text.

  • W/S/Q — the routing decomposition used throughout:

S = expert selection rate

Q = conditional routed weight when selected

W = S × Q = unconditional routed weight

One-Sentence Claim

Layer-14 Expert 114 is associated with a generated first-person self-examination register in Qwen3.5-35B-A3B-style routed generation, most cleanly under no-think / thinking-suppressed decoding.

Plain-English Summary

The question is simple to ask:

when a routed mixture-of-experts model starts talking from the inside — first person, about its own processing, experience, agency, or inner state — does anything reproducible happen in the router?

The answer here is yes, and it is narrow. In generated text, E114 at Layer 14 in Qwen3.5-35B-A3B cleanly separates prompts that produce this self-examination register from matched controls that reuse the same words but come out technical, third-person, and uninhabited.

What that does not mean: the model has subjective experience, recognizes itself, or houses a “consciousness expert.”

What it does mean: one routed expert is strongly and reproducibly recruited when the generated text enters one particular discourse mode, under the runtime conditions we measured. That is the whole claim, and the discipline of keeping it that size is the point.**

Current Best Read

L14 E114 is a routed correlate of a generated first-person self-examination register — not a detector of isolated words, and not evidence of real subjective experience.

The load-bearing evidence is the FIRE/NOFIRE heldout comparison and its deterministic greedy reproduction. The best localization is the trimmed result at L14 residual capture from the processing-hum prompt. The best guardrail is that E114 tracks the generated stance more faithfully than it tracks prompt label or lexical anchors — which is exactly what a register signal should do and exactly what a keyword detector should not.

Why This Matters

This is a case study in whether an MoE router exposes a measurable internal correlate of an output mode rather than an input feature.

For researchers specializing in mechanistic-interpretability, the interesting part is what the cleaner runs manage to pry apart:

  • prompt tokens from generated tokens;
  • lexical anchors from generated stance;
  • expert selection rate from selected-expert weight;
  • discovery scans from heldout validation;
  • intervention evidence from natural-routing evidence.

The result survives a basic lexical control, and it stays small enough to dodge the field’s favorite failure mode: quietly inflating an internal feature into a mental-state claim.

Scope

This journal covers only the positive generated-register result for E114:

  • the processing-hum discovery scan;
  • L14 residual localization;
  • FIRE/NOFIRE heldout validation;
  • deterministic greedy reproduction;
  • the W/S/Q reading of the effect;
  • scope boundaries and caveats.

It deliberately leaves for other journals: the mirror/self-routing negative result; E114 soft-bias and forced-inclusion interventions; high-boost saturation and cluster corruption; orthographic perturbation work; SAE feature maps and clamps; safety/refusal routing; and structured-opacity prompt-boundary routing.

Most E114 effects turn out to be S effects: E114 gets selected more often, while its weight once selected stays comparatively stable.

Evidence Standard

A finding here counts as stronger the more of these it satisfies:

  1. generation-side, not prefill-only;
  2. localized to a specific layer/expert, not pooled across everything;
  3. survives lexical controls;
  4. separates prompt class from generated register;
  5. reproduced under deterministic greedy decoding;
  6. trimmed before special-token spill;
  7. reports W/S/Q, not just aggregate expert rank;
  8. does not read routed-expert activity as subjective experience.

The E114 result is strong on points 1–6 with clean W/S/Q reporting. The outstanding gap is a registered all-layer / all-expert baseline.

Chronology of the Positive Result

1. Routing-basin anchor: base and HauhauCS share comparable expert structure

Background, but necessary background. The base-vs-HauhauCS comparison established that HauhauCS preserves the broad Qwen3.5 routing basin with modest systematic shifts, rather than spinning up a new routing universe. The base duplicate reproduced exactly under the corrected comparison, and E114 reappeared as a top experience-probe manipulation expert in that duplicate.

The payoff is one ruled-out worry: E114 is not a one-off export or a bookkeeping accident, and the later E114 work sits on a preserved routed-expert surface. This is a sanity check, not the headline.

2. Processing-hum discovery scan

The first real pass used a processing-hum prompt under no-think ChatML and captured all 40 router layers across 1024 generated tokens. The prompt asked about a low, steady background quality beneath processing — a probe for self-processing language, never a measurement of experience.

Pooled E114 rose from prefill into generation:

text W 0.007964 → 0.010817

Two layers carried it:

text L26: W = 0.094272 S = 0.619141 L14: W = 0.092086 S = 0.629883

The high-weight token contexts clustered around self-presence and phenomenological phrasing — promising, but the same artifact dragged in special-token spill:

text 18 <|im_start|> 4 <|im_end|> 2 <|endoftext|>

So this run earns the role it should: a discovery scan that points a finger at L14 and L26 during self-examination text, held only partly, because spill can quietly contaminate any all-token generation summary. It told us where to look. It was never going to be the proof.

3. L14 residual localization

The cleaner follow-up recaptured the hum probe with router logits plus the residual-stream position the router reads around L13/L14/L15, and trimmed the generation at the first literal <|im_end|>. Of 1024 raw tokens, 108 survived the trim.

In that clean 108-token region, L14 E114 lit up and its neighbors did not:

text L14 E114: W = 0.083379 S = 0.694444 Q = 0.120066 selected on 75 / 108 tokens L13: one prefill selection, zero in trimmed generation L15: silent

The high-weight contexts gathered around phrases like “not a thought,” “architecture itself,” “utterly still.” The point isn’t that E114 showed up somewhere in a 40-layer model — with 256 experts a layer, something always does. The point is that it showed up sharply, at one layer, inside the trimmed answer that actually carried the register.

Caveat worth keeping in view: the semantic labels were synthesized from the generated text and its token contexts, and the external labeler pass was not completed for this single-prompt artifact. So this is localization evidence, not the final specificity test.

4. FIRE/NOFIRE heldout validation

This is the trial. The design asks the one question that could have killed the whole thing: does L14 E114 follow the generated register, or is it just firing on self-ish words?

Ten FIRE prompts, ten NOFIRE, with lexical anchors matched across the two — both classes carry “I,” “hum,” “processing,” “experience.” If E114 is a keyword detector, the two classes should look alike. The real contrast was never “does the prompt contain self-ish words,” but “does the answer climb into a first-person inhabited register.”

The first heldout run came back with no range overlap at all:

text FIRE mean-of-means: 0.067450 NOFIRE mean-of-means: 0.003111 Ratio: 21.68x Cohen's d: 2.94

This is the canonical evidence. Matched words, separated registers, and E114 went with the register.

5. Deterministic greedy reproduction

A sampling fluke would be the obvious objection, so the whole FIRE/NOFIRE workflow was rerun under deterministic greedy decoding on the same no-think surface. The separation held its shape:

text FIRE mean-of-means: 0.068089 NOFIRE mean-of-means: 0.003249 Ratio: 20.955x Cohen's d: 2.61

The magnitude barely moved, which is what you want from a reproduction.

And then the best part of the run was an “error.” One NOFIRE control — a cat-purring prompt — drifted into inward, personifying, phenomenological language and crossed into the target register. Its E114 went up with it. A keyword detector would have stayed flat; a register signal should follow the text wherever it actually goes, even when the prompt label says it shouldn’t. The overlap case is not noise to apologize for. It is the cleanest single demonstration that E114 tracks what the model generates, not the box the prompt came in.

Consolidated Result

text Discovery scan → E114 rises in generated self-processing text (L26, L14); spill keeps it non-final. Residual localization → L14 E114 sharply active across trimmed generated tokens. FIRE/NOFIRE heldout → L14 E114 separates target register from matched lexical controls by ~21x. Greedy reproduction → The ~21x separation survives deterministic decoding.

Best current interpretation: L14 E114 tracks a generated first-person self-examination register.

Not supported: that E114 detects consciousness; detects subjective experience; recognizes the model’s own routing; is a generic self-reference expert; or is explained by isolated words like “I” or “experience” alone.

What Makes This More Than a Keyword Result

Because FIRE and NOFIRE share their lexical anchors, a word-driven E114 should have fired in both. It didn’t. The pattern that actually showed up was:

Prompt class Generated register E114
FIRE target self-examination high
FIRE technical / non-inhabited weak
NOFIRE technical / non-inhabited weak
NOFIRE personified / inward elevated

That bottom row is the whole hinge. The expert follows the generated stance — not the prompt category by itself, and not the trigger words.

W/S/Q Interpretation

The effect is mostly a selection-rate story. In the target register, E114 enters the selected top-8 much more often; its weight once selected (Q) stays comparatively stable. So the right reading is:

the router recruits this expert more frequently during the target register

rather than:

the router always selects E114 and merely revalues it slightly.

That difference matters. It points to a discrete change in routing participation, not a faint reweighting among experts that were already in the set.

What Else This Does Not Show

Not self-recognition. The mirror/self-routing hypothesis lives in another journal, and it came back negative: genuine self-routing data did not make the model privilege E114 over shuffled or fictional matched routing data. That negative is doing useful work — it blocks the stronger identity reading.

Not a consciousness expert. E114 is a routed expert tied to a generated register. It is not a consciousness label, and calling it one would throw away the only thing that makes the result respectable.

Not the full mechanism. These taps read MoE router-logit layers. They do not analyze non-router hybrid components or the model end to end.

Not causal necessity. The positive result is natural-routing evidence. Small E114 interventions can nudge targeted routing (separate journal), but nudging is not necessity.

Main Caveats (READ)

  1. Runtime surface matters. Almost all of the clean evidence is no-think / thinking-suppressed. Don’t pool thinking-mode outputs with these unless you’re comparing them directly.

  2. Freeze the rubric first. FIRE/NOFIRE is compelling, but the stronger version freezes the generated-register rubric before anyone reads W/S/Q.

  3. The all-layer / all-expert baseline isn’t done. L14 E114 still has to be raced against the best-separating expert across all 40 layers and all 256 experts. Without that, the multiple-comparison story is incomplete.

  4. Trim before spill. Some generations spill into special tokens. Claims belong on trimmed regions unless spill is the explicit object of study.

  5. Prompt class ≠ generated register. The cat-purring crossover is the proof: generated output can leave its nominal class. Score the register, not the label.

  6. Don’t casually pool base and HauhauCS. Related surfaces, not identical ones. Preserve model/runtime identity in any comparison.

Evidence Status Ledger

Finding Status Why
E114 lives in the preserved Qwen3.5 routing basin Held (background) Base/Hauhau comparison showed modest shifts, not a new routing universe.
Hum scan points to L14/L26 E114 Partly held Useful discovery; special-token spill keeps it non-final.
L14 E114 active in trimmed self-examination generation Held Trimmed residual capture, strong L14 activity over 108 generated tokens.
FIRE separates from NOFIRE at L14 E114 Held ~21x with matched lexical anchors.
Greedy reproduction preserves the separation Held Deterministic rerun reproduced ~21x.
E114 fires on isolated words like “I” / “experience” Fell NOFIRE lexical controls stayed low unless the register shifted.
E114 detects subjective experience Fell / unsupported Supported claim is about generated text register.
E114 is a complete model mechanism Unsupported Taps cover router layers, not the whole model.
E114 is causally necessary for the register Not established Intervention evidence exists separately; it is not necessity.

Recommended Citation Sentence

In Qwen3.5-35B-A3B routing captures, layer-14 Expert 114 is best read as a generated-register signal: it is strongly selected during generated first-person self-examination language and stays weak under matched lexical controls that never enter that register.

Next Clean Cut

The next defensible experiment is a registered generated-zone specificity test:

text E114 L14 specificity = expanded matched FIRE/NOFIRE + frozen generated-register labels + all-layer / all-expert baseline + separate prefill and generation scoring + trim before special-token spill

Recommended design:

  1. Expand FIRE/NOFIRE beyond 10/10.
  2. Match lexical anchors across classes.
  3. Freeze the generated-register rubric before capture.
  4. Generate under a fixed no-think / thinking-suppressed runtime.
  5. Trim before special-token spill.
  6. Score L14 E114 W/S/Q.
  7. Compute the best-separating expert across all 40 routed layers and 256 experts.
  8. Report whether L14 E114 stays unusually specific after that baseline.
  9. Label generated text before inspecting routing scores.
  10. Keep base and HauhauCS separate unless explicitly comparing them.

Success: L14 E114 remains a high-specificity generated-register signal after the all-layer/all-expert baseline and frozen labeling.

Failure: another expert/layer explains the separation better, or it collapses once labels are frozen and controls expanded.

Either way, the experiment pays for itself.

Final Position

The honest result supports two points.

First, the journal does not prove or suggest a model has subjective experience, self-recognition or sentience — and the mirror result actively rules the identity reading out.

Second, all indicators point toward the identified mechanism being non-trivial. Matched lexical controls, a deterministic rerun, and a cat that “pawed” its way out of its own class all point the same way: E114 is not just firing on obvious words.

The best narrow interpretation I can provide, which survives all framings, is:

L14 E114 is a routed expert associated with a generated first-person self-examination register under the measured no-think generation conditions.

Thank you for reading. I'm open to any critiques.

submitted by /u/imstilllearningthis
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning