Hugging Face Daily Papers · May 20, 2026 · 5 min read

Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.","html":"<p>Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.</p>\n","updatedAt":"2026-05-20T04:13:30.712Z","author":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","fullname":"Dhaval Patel","name":"DhavalPatel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8505786657333374},"editors":["DhavalPatel"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18827","authors":[{"_id":"6a0d34cf65eb30f20d962d17","name":"Prateek Biswas","hidden":false},{"_id":"6a0d34cf65eb30f20d962d18","name":"Dhaval Patel","hidden":false},{"_id":"6a0d34cf65eb30f20d962d19","name":"Vedant Khandelwal","hidden":false},{"_id":"6a0d34cf65eb30f20d962d1a","name":"Shuxin Lin","hidden":false},{"_id":"6a0d34cf65eb30f20d962d1b","name":"Amit Sheth","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds","submittedOnDailyBy":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","isPro":false,"fullname":"Dhaval Patel","user":"DhavalPatel","type":"user","name":"DhavalPatel"},"summary":"Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.","upvotes":1,"discussionId":"6a0d34cf65eb30f20d962d1c","ai_summary":"Code-Guided Reasoning (CGR) evaluates how executable reasoning scaffolds enhance small language model performance on multiple-choice question answering tasks through standardized components and measured improvements.","ai_keywords":["multiple-choice QA","small language models","executable reasoning","code-guided reasoning","direct solver prompt","generator prompt","Python scaffold","solver-call helpers","answer extraction","macro accuracy","bootstrap interval","direct-signal gate","generated programs","trace package","response metadata"],"organization":{"_id":"616e7b1d75754a5d5fa455cf","name":"ibm","fullname":"IBM","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bfdf60dc13843b468ac20/9228luWRoGbZwKGxkOOsj.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"616e7b1d75754a5d5fa455cf","name":"ibm","fullname":"IBM","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bfdf60dc13843b468ac20/9228luWRoGbZwKGxkOOsj.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18827.md"}">

Papers

arxiv:2605.18827

Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

Published on May 12

· Submitted by

Dhaval Patel on May 20

IBM

Upvote

Authors:

Abstract

Code-Guided Reasoning (CGR) evaluates how executable reasoning scaffolds enhance small language model performance on multiple-choice question answering tasks through standardized components and measured improvements.

AI-generated summary

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.

View arXiv page View PDF Add to collection

Community

DhavalPatel

Paper submitter about 9 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.18827

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.18827 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.18827 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18827 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers