Hugging Face Daily Papers · · 5 min read

Do not copy and paste! Rewriting strategies for code retrieval

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Our article explores how rewriting code into different forms can improve code retrieval systems.<br>We tested three rewriting strategies: rephrasing code, converting it into pseudocode, and translating it into full natural language.<br>We found that rewriting both the query and the code corpus together produces the best retrieval performance.<br>Our strongest results came from using full natural language rewriting, especially for smaller code encoders.<br>However, rewriting only the corpus often reduced performance because of mismatches between the query and the rewritten code.<br>We also introduced a diagnostic called ∆H, which helps predict when rewriting will improve retrieval results.<br>Overall, our article demonstrates that LLM-based rewriting can significantly enhance code search when applied in the right conditions.</p>\n","updatedAt":"2026-05-13T09:18:35.326Z","author":{"_id":"653a3c2977e891ae3b5f9510","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653a3c2977e891ae3b5f9510/zxBnCDOXia3MFKPVGd_2S.jpeg","fullname":"Andrea Gurioli","name":"andreagurioli1995","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8540920615196228},"editors":["andreagurioli1995"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/653a3c2977e891ae3b5f9510/zxBnCDOXia3MFKPVGd_2S.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08299","authors":[{"_id":"6a04401986b054ce2fa410cd","name":"Andrea Gurioli","hidden":false},{"_id":"6a04401986b054ce2fa410ce","name":"Federico Pennino","hidden":false},{"_id":"6a04401986b054ce2fa410cf","name":"Maurizio Gabbrielli","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Do not copy and paste! Rewriting strategies for code retrieval","submittedOnDailyBy":{"_id":"653a3c2977e891ae3b5f9510","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653a3c2977e891ae3b5f9510/zxBnCDOXia3MFKPVGd_2S.jpeg","isPro":false,"fullname":"Andrea Gurioli","user":"andreagurioli1995","type":"user","name":"andreagurioli1995"},"summary":"Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p < 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.","upvotes":8,"discussionId":"6a04401986b054ce2fa410d0","ai_summary":"Research investigates how different text rewriting strategies impact code retrieval performance, identifying that full natural language rewriting provides the greatest improvements while proposing entropy-based diagnostics to determine when such costly rewrites are beneficial.","ai_keywords":["embedding-based code retrieval","LLM rewriting","stylistic rephrasing","NL-enriched PseudoCode","natural-language transcription","query-corpus augmentation","corpus-only augmentation","CoIR benchmarks","encoder performance","rewriter families","token entropy","embedding cosine","Delta H","Delta s","retrieval gain","cost-benefit analysis"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"653a3c2977e891ae3b5f9510","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653a3c2977e891ae3b5f9510/zxBnCDOXia3MFKPVGd_2S.jpeg","isPro":false,"fullname":"Andrea Gurioli","user":"andreagurioli1995","type":"user"},{"_id":"65c9e68a71e2083defbe5c6e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/B37pED2jw_pbedo8LfN-c.jpeg","isPro":false,"fullname":"Federico Pennino","user":"halykoss","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"66e9667da54250da3f471c7b","avatarUrl":"/avatars/b005397cb791c2c5773fc0f401b42d0f.svg","isPro":false,"fullname":"LLM","user":"isThisYouLLM","type":"user"},{"_id":"6a04b66c0bacb750ab9cc3d4","avatarUrl":"/avatars/af180d80f1d2611c757d66f2e98d7bef.svg","isPro":false,"fullname":"Francesco Mardi","user":"alphaCode25","type":"user"},{"_id":"6a04b6d57cad6b2cddaa54c6","avatarUrl":"/avatars/08a3934a6717b88a6c3d621d4d67eea2.svg","isPro":false,"fullname":"Mark buffalo","user":"Mark7723","type":"user"},{"_id":"6a04b72e51f549d2b5909ba0","avatarUrl":"/avatars/7e18d07b023633dbdd4506a6a0d8c371.svg","isPro":false,"fullname":"mark ruffalo","user":"markRuffalo5","type":"user"},{"_id":"6a0464d0abaa37d1f4506758","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a0464d0abaa37d1f4506758/qZ6kK8_hdYai26VDiK6GM.jpeg","isPro":false,"fullname":"kuzcaaxdwn","user":"kuzcaaxdwn","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.08299.md"}">
Papers
arxiv:2605.08299

Do not copy and paste! Rewriting strategies for code retrieval

Published on May 8
· Submitted by
Andrea Gurioli
on May 13
Authors:
,
,

Abstract

Research investigates how different text rewriting strategies impact code retrieval performance, identifying that full natural language rewriting provides the greatest improvements while proposing entropy-based diagnostics to determine when such costly rewrites are beneficial.

AI-generated summary

Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p < 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.

Community

Our article explores how rewriting code into different forms can improve code retrieval systems.
We tested three rewriting strategies: rephrasing code, converting it into pseudocode, and translating it into full natural language.
We found that rewriting both the query and the code corpus together produces the best retrieval performance.
Our strongest results came from using full natural language rewriting, especially for smaller code encoders.
However, rewriting only the corpus often reduced performance because of mismatches between the query and the rewritten code.
We also introduced a diagnostic called ∆H, which helps predict when rewriting will improve retrieval results.
Overall, our article demonstrates that LLM-based rewriting can significantly enhance code search when applied in the right conditions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.08299
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08299 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08299 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08299 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers