r/LocalLLaMA · · 6 min read

Update: First Manual Results from Testing Procedural Skill Transfer in Small Models

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Update: First Manual Results from Testing Procedural Skill Transfer in Small Models

Yesterday I posted an idea for testing whether a large model can transfer some of its procedural skill to a smaller model without fine-tuning.

The short version of the idea was this:

Small models are often not completely lacking knowledge. They know the syntax. They know the libraries. They usually understand the task at a basic level. The problem is that their outputs are shallow. They skip planning, hierarchy, decomposition, visual structure, and the kind of step-by-step discipline that bigger models seem to apply more naturally.

So I wanted a test where this difference would be visible.

That is why I used Three.js.

With normal code tasks, a model can sometimes hide weakness behind verbose explanations or familiar patterns. With Three.js, the render exposes the actual structure. If the model does not plan the geometry, camera, lighting, proportions, hierarchy, and composition, the output looks bad immediately.

The experiment was based on two domains.

The first domain was a complex character scene: a Thriller-style choreography scene with multiple recognizable characters, animation, lighting, stage composition, and cinematic presentation.

The second domain was completely different: a low-poly BMPT-72 turret with a recognizable silhouette.

Both use Three.js, but they are not the same kind of task. One is about characters, posing, choreography, environment, and staging. The other is about mechanical shape, turret structure, weapons, silhouette, and object proportions.

The idea was not to transfer the scene itself. The idea was to transfer the process.

The simplified protocol is:

A = larger model B = smaller model P1 = source prompt P2 = target prompt S = procedural scaffold 

First:

A + P1 -> D1A A + P2 -> D2A B + P1 -> D1B B + P2 -> D2B 

Then the larger model creates a scaffold from the weakness of the smaller model in the first domain:

A + P1 + code/render of D1B -> S 

The important rule is that the model creating S does not see P2, does not see D2A, and does not know what the target-domain test will be.

Then the smaller model is run again:

B + S + P1 -> D1B_S B + S + P2 -> D2B_S 

The real question is whether:

D2B_S is closer to D2A than D2B was 

In other words, did the scaffold improve the smaller model on a different task, without showing it the answer?

I ran a first manual test and put the outputs in a video.

This is not a formal benchmark yet. It is just a first sanity check to see if the effect is real enough to automate later.

The result was actually pretty clear.

On DeepSeek V4 Pro, which is already a much stronger model, the scaffold did help, but mostly as polish. It improved lighting, presentation, scene decoration, and the overall art direction. But the baseline was already structurally decent, so the difference was not huge.

That part makes sense to me. A larger model already has more internal planning depth. The scaffold does not give it a new brain. It mostly pushes it to be more explicit and consistent.

The much bigger difference appeared on Qwen 27B and also on the 35B A3 model quantized to Q3_K_M.

Without the scaffold, the Qwen outputs often had the usual smaller-model failure mode: objects thrown into a dark scene, weak environment, poor contrast, shallow hierarchy, and primitive shapes that technically satisfy part of the prompt but do not really form a readable scene.

With the scaffold, the same model started behaving differently.

In the Thriller scene, it produced a more readable stage, separated characters better, added environmental structure, used stronger lighting, and gave the scene more depth. It still was not perfect, but it stopped looking like disconnected primitives in a dark void.

In the turret task, the improvement was also visible. The baseline was closer to a generic dark blocky vehicle. The scaffolded version had a clearer body, better turret structure, more deliberate weapon placement, side details, sensor-like elements, and a more readable silhouette.

The 35B Q3_K_M result was also interesting. Even with heavy quantization, the scaffold seemed to help it hold the structure together. It did not become a frontier model, but it followed the construction process better than the baseline.

The part that matters most to me is that the scaffold did not simply copy the first domain.

It did not put Thriller details into the tank. It did not add human limbs to the turret. It did not confuse the character scene with the mechanical object.

What transferred was more abstract:

plan before coding define the scene contract build in layers separate subject, environment, lighting, and camera preserve silhouette add identity cues avoid plain primitive-only objects audit the final output 

That is exactly the kind of thing I was trying to test.

My current interpretation is that this works less like a normal “better prompt” and more like an external planning scaffold. Smaller models often know enough to do parts of the task, but they do not maintain the full structure across a long generation. The scaffold gives them a temporary planning discipline inside the context.

The effect also seems asymmetric.

The bigger model improved a bit, mostly in polish.

The smaller models improved much more, especially in structure and readability.

That fits the original hypothesis: smaller models may have the knowledge, but not enough procedural control to organize it reliably.

Again, this is not proof yet.

The next step is to turn this into a proper blind test:

D2A = large model target-domain output D2B = small model baseline target-domain output D2B_S = small model target-domain output with scaffold 

Then a separate blind evaluator should compare only the rendered images, without knowing which model produced which output and without seeing the code.

The key metric would be:

Score(D2A, D2B_S) > Score(D2A, D2B) 

If that holds across many prompts, then the scaffold is not just improving one example. It is transferring a reusable procedure.

For now, I would only call this a preliminary manual result. But after watching the outputs side by side, I think the idea is worth testing more seriously.

The main takeaway so far:

A scaffold derived from one Three.js domain seems to help smaller models produce better structure in another Three.js domain, without fine-tuning and without seeing the target-domain answer.

That does not mean the small model becomes as good as the large model.

It means the large model may be able to externalize part of its planning discipline into a reusable inference-time structure.

That is the part I want to test properly next.

submitted by /u/ConfidentDinner6648
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA