r/LocalLLaMA · June 28, 2026 · 3 min read

A Blind Visual Paradigm for Testing Skill Transfer in Small Models Without Fine-Tuning

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

A Blind Visual Paradigm for Testing Skill Transfer in Small Models Without Fine-Tuning

TL;DR: Small models aren't dumb, they're shallow. I designed a cross-domain, blind, visual experiment to see if a large model can compress its "planning discipline" into a reusable scaffold that makes a small model deeper — with zero fine-tuning. Three.js is the testbed because you can't fake structure with verbose text; the render exposes everything.

I’ve been spending a lot of time testing smaller models (like 9B parameters), and I’ve noticed something: they aren’t exactly dumb, they are just shallow. They understand the task, but their outputs lack planning depth, hierarchy, and procedural discipline. They skip the structural steps that larger models apply naturally.

This got me thinking: can a large model (Model A) compress its procedural ability into a reusable structure that makes a smaller model (Model B) perform deeper, without any fine-tuning? And more importantly, can we prove this transfer of skill is real and not just overfitting?

I came up with an experimental paradigm to test this using Three.js. I chose Three.js because it’s easy to verify visually, but hard to generate correctly. A model can't just output verbose text to hide its lack of understanding; the rendered image exposes its true procedural depth.

Here is the baseline of the experiment. Look at these 4 images:

Image 1 (D1A): Model A (Large) output for a complex cinematic scene (Michael Jackson, Pepe, Trump, and Elon Musk performing \"Thriller\").

Image 2 (D1B): Model B (Small) output for the exact same prompt. Notice how it gets the concept, but the result is visually shallow, structurally weak, and lacks hierarchy.

Image 3 (D2A): Model A output for a completely different, semantically distinct domain: \"Make a BMPT-72 turret in Three.js - low poly with recognizable silhouette.\"

Image 4 (D2B): Model B baseline output for the turret. Again, shallow.

The Theory:
My hypothesis is that Model A can look at the gap between D1A and D1B and extract a general "Procedural Scaffold" (S).

S is a set of instructions, decomposition steps, or a hardness logic (e.g., plan -> geometry -> silhouette check -> detailer -> renderer -> critic).
Crucial rule: S cannot contain the answer to D1. It must only extract the deeper construction principles.

The Real Test (What I haven't run yet):
To prove S is transferable, we apply Scaffold S to Model B and ask it to generate the BMPT-72 turret again (D2B_S).

The Blind Validation:
This is the catch. To prove the improvement is real, we use a fresh instance of Model A (Model C) as a blind judge. Model C has zero context about the experiment, the scaffold, or the prompts. It receives only the rendered images of D2A, D2B, and D2B_S. Model C is asked to score the images quantitatively (0-10) on visual quality, recognizable silhouette, structural coherence, and detail density.

The Conclusion:
If the instruction S, extracted from the Thriller scene (D1), increases the quality of Model B's output in the Turret domain (D2)—where D2 is completely different from D1—then the instruction S is not just overfitted to the source example.

If Score(D2A, D2B_S) > Score(D2A, D2B), meaning the scaffolded small model gets visually closer to the large model's baseline without ever seeing the answer, then S contains transferable procedural knowledge within the platform.

I genuinely think this visual, blind, cross-domain setup could be a great paradigm to prove post-training skill generalization. Does this make sense? Where do you think the setup might fail?

submitted by /u/ConfidentDinner6648
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA