r/LocalLLaMA · June 17, 2026 · 2 min read

Multilingual-Multimodal-NLP/LoopCoder-V2 · Hugging Face

#multimodal #paper

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

GitHub : https://github.com/CSJianYang/LoopCoder
arXiv : https://arxiv.org/abs/2606.18023
Full Paper PDF : https://arxiv.org/pdf/2606.18023

LoopCoder-V2

LoopCoder-v2 is a 7B instruction-tuned code model based on the Parallel Loop Transformer (PLT). The model studies test-time computation scaling through repeated application of shared Transformer blocks while keeping the parameter count fixed.

The released checkpoint is the two-loop PLT variant (plt_num_loops=2). In the accompanying paper, this setting gives the best gain-cost trade-off: the second loop provides most of the useful latent refinement, while additional loops show diminishing or unstable updates.

Highlights

7B dense PLT coder trained from scratch on 18T tokens of mixed text and code data.
Instruction-tuned with a matched supervised fine-tuning recipe.
Uses cross-loop position offsets and shared-KV gated sliding-window attention.
Targets code generation, multilingual code, code reasoning, agentic software engineering, and tool-use workflows.
Strongest loop-count setting in the paper: two loops, not more.

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

TL;DR. For Parallel Loop Transformers (PLT), more looping is not better. A 7B coder that loops just once more than usual (two passes total) lifts SWE-bench Verified from 43.0 → 64.4, while three or more loops regress. We explain this with a gain–cost view of looping and provide diagnostics for picking the loop count without brute-force sweeps.

Overview

Looped Transformers scale latent computation by repeatedly applying a shared block, but sequential looping increases latency and KV-cache memory with the loop count. Parallel Loop Transformers (PLT) alleviate this with two mechanisms:

CLP — cross-loop position offsets, which break sequential inter-loop dependencies and enable parallel loop execution.
G-SWA — shared-KV gated sliding-window attention, which keeps the cache footprint nearly constant across loop counts.

Once cost is flattened, loop count becomes a free design knob — and the question becomes: how many loops are actually worth it? We study this through a gain–cost lens: an extra loop may refine representations (gain), but CLP also introduces a roughly constant positional mismatch at each loop boundary (cost), which we quantify with an intrinsic offset cost Ω(r).

We instantiate the study with LoopCoder-v2, a family of 7B PLT coders trained from scratch on 18T tokens of mixed text and code (1:1, 100+ programming languages), under matched training, instruction tuning, and evaluation.

submitted by /u/pmttyji
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Highlights

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

Overview

Discussion (0)

More from r/LocalLLaMA