Hugging Face Daily Papers · June 19, 2026 · 6 min read

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

The Multi-LCB benchmark evaluates LLM code generation capabilities on identical algorithmic tasks across twelve programming languages, covering both single-turn and agentic scenarios.\n","updatedAt":"2026-06-19T10:04:06.084Z","author":{"_id":"626474fc247eba6089349be1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qrakyLlNdJBjgUURsDgQP.png","fullname":"Dmitri Babaev","name":"dllllb","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7448287010192871},"editors":["dllllb"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qrakyLlNdJBjgUURsDgQP.png"],"reactions":[],"isReport":false}},{"id":"6a35325b3421fae0969cc894","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-19T12:13:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. It feels like everyone has been benchmarking strictly on Python for too long, so seeing someone actually push for a multilingual standard that keeps up with fresh competitive programming problems is a welcome change.\n\nI'm curious about the translation process they used to convert the Python tasks into twelve other languages. How do they ensure that the difficulty level and the logic requirements remain consistent across such a diverse set of languages?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/48fc95dc-b07e-4f50-bd09-6170b23ca5cd","html":"Neat paper. It feels like everyone has been benchmarking strictly on Python for too long, so seeing someone actually push for a multilingual standard that keeps up with fresh competitive programming problems is a welcome change.\nI'm curious about the translation process they used to convert the Python tasks into twelve other languages. How do they ensure that the difficulty level and the logic requirements remain consistent across such a diverse set of languages?\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go: <a href=\"https://researchpod.app/episode/48fc95dc-b07e-4f50-bd09-6170b23ca5cd\" rel=\"nofollow\">https://researchpod.app/episode/48fc95dc-b07e-4f50-bd09-6170b23ca5cd</a>\n","updatedAt":"2026-06-19T12:13:15.534Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9199656248092651},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[{"reaction":"🔥","users":["dllllb"],"count":1}],"isReport":false},"replies":[{"id":"6a35749858176ef10b8305b7","author":{"_id":"626474fc247eba6089349be1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qrakyLlNdJBjgUURsDgQP.png","fullname":"Dmitri Babaev","name":"dllllb","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-06-19T16:55:52.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We translated the evaluation tasks into an input–output format, requiring solutions to read from standard input (stdin) and write to standard output (stdout). We also developed language-specific evaluation scripts, one for each supported programming language. This enables LLMs to solve the same task in any of the supported languages while being evaluated consistently across all of them.","html":"We translated the evaluation tasks into an input–output format, requiring solutions to read from standard input (stdin) and write to standard output (stdout). We also developed language-specific evaluation scripts, one for each supported programming language. This enables LLMs to solve the same task in any of the supported languages while being evaluated consistently across all of them.\n","updatedAt":"2026-06-19T16:55:52.973Z","author":{"_id":"626474fc247eba6089349be1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qrakyLlNdJBjgUURsDgQP.png","fullname":"Dmitri Babaev","name":"dllllb","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8863292336463928},"editors":["dllllb"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qrakyLlNdJBjgUURsDgQP.png"],"reactions":[],"isReport":false,"parentCommentId":"6a35325b3421fae0969cc894"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2606.20517","authors":[{"_id":"6a3510a2156f0a50f94c1add","name":"Maria Ivanova","hidden":false},{"_id":"6a3510a2156f0a50f94c1ade","name":"Pavel Zadorozhny","hidden":false},{"_id":"6a3510a2156f0a50f94c1adf","name":"Rodion Levichev","hidden":false},{"_id":"6a3510a2156f0a50f94c1ae0","name":"Ivan Petrov","hidden":false},{"_id":"6a3510a2156f0a50f94c1ae1","name":"Adamenko Pavel","hidden":false},{"_id":"6a3510a2156f0a50f94c1ae2","name":"Ivan Lopatin","hidden":false},{"_id":"6a3510a2156f0a50f94c1ae3","name":"Alexey Kutalev","hidden":false},{"_id":"6a3510a2156f0a50f94c1ae4","name":"Dmitrii Babaev","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/626474fc247eba6089349be1/-NMjA2FhOoF-Fk1ep6zSv.png","https://cdn-uploads.huggingface.co/production/uploads/626474fc247eba6089349be1/mmt8MIuLJmG7UqkVdIgJM.png","https://cdn-uploads.huggingface.co/production/uploads/626474fc247eba6089349be1/zG7QQLug7eHcu8AR6Kafy.png"],"publishedAt":"2026-06-18T00:00:00.000Z","submittedOnDailyAt":"2026-06-19T00:00:00.000Z","title":"Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages","submittedOnDailyBy":{"_id":"626474fc247eba6089349be1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qrakyLlNdJBjgUURsDgQP.png","isPro":false,"fullname":"Dmitri Babaev","user":"dllllb","type":"user","name":"dllllb"},"summary":"LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering.\n We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python.\n We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.","upvotes":25,"discussionId":"6a3510a3156f0a50f94c1ae5","projectPage":"https://multi-lcb.github.io/","githubRepo":"https://github.com/Multi-LCB/Multi-LCB","githubRepoAddedBy":"user","ai_summary":"Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols.","ai_keywords":["large language models","code-generation tasks","competitive programming problems","contamination-aware evaluation","cross-language code generation","multilingual performance","Python overfitting","language-specific contamination"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":22},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"626474fc247eba6089349be1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qrakyLlNdJBjgUURsDgQP.png","isPro":false,"fullname":"Dmitri Babaev","user":"dllllb","type":"user"},{"_id":"6560a245280cd6b710720a41","avatarUrl":"/avatars/0bec7266fd7dcf5320670ddc1d04a7fb.svg","isPro":false,"fullname":"Niiaz","user":"nakazkan","type":"user"},{"_id":"659933f7be7822d24d6c8c71","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/wbKhDDm7By9qCJjsuPWG7.jpeg","isPro":false,"fullname":"Bogdan","user":"Bog3008","type":"user"},{"_id":"647da84a11084fb583185e72","avatarUrl":"/avatars/b370a406cb43e9a1af62ee7a4b525467.svg","isPro":false,"fullname":"yottabufer","user":"yottabufer","type":"user"},{"_id":"6012bb3a435587860ae335c9","avatarUrl":"/avatars/45202830afdeb5b77f6b38dd33028aa7.svg","isPro":false,"fullname":"Dmitry Vorobiev","user":"dmitry-vorobiev","type":"user"},{"_id":"6568b7402419be607254e197","avatarUrl":"/avatars/fc22e5ba9c50a4e12e7030a7ae490b1b.svg","isPro":false,"fullname":"Zadorozhny","user":"pavul","type":"user"},{"_id":"6395dcaaf9d208225d8e96f5","avatarUrl":"/avatars/fdc4b1db78811cabff67500a05eac9c0.svg","isPro":false,"fullname":"Iaroslav Khripkov","user":"ElijahKamski","type":"user"},{"_id":"672503c59f68afdd63cc81a2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/672503c59f68afdd63cc81a2/lw4ApCTwAKgt_uUyfSVRH.jpeg","isPro":false,"fullname":"Nikita Gushchin","user":"ngushchin","type":"user"},{"_id":"67b774d292cb326e2e74e6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/3BpeiwQTue377YlO2ERXD.png","isPro":false,"fullname":"Daniil Maslov","user":"dmasloff","type":"user"},{"_id":"649476ba404c996072cce24a","avatarUrl":"/avatars/b889075d5361378c61a502f80a9663f0.svg","isPro":false,"fullname":"Rodion Levichev","user":"RLevichev","type":"user"},{"_id":"68b195ad8875c6a47423d12f","avatarUrl":"/avatars/923d29c1b411e57bb7d458f27a511aae.svg","isPro":false,"fullname":"Sh","user":"AzamBalanced","type":"user"},{"_id":"6765980f3c532758cdb03a6d","avatarUrl":"/avatars/feaa3fa70f2360901c8793bc0a0566dd.svg","isPro":false,"fullname":"Max","user":"Dropdead072","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.20517.md","query":{}}">

Papers

arxiv:2606.20517

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Published on Jun 18

· Submitted by

Dmitri Babaev on Jun 19

Upvote

Authors:

Abstract

Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

View arXiv page View PDF Project page GitHub 22 Add to collection

Community

dllllb

Paper submitter about 12 hours ago

The Multi-LCB benchmark evaluates LLM code generation capabilities on identical algorithmic tasks across twelve programming languages, covering both single-turn and agentic scenarios.

noahml

about 10 hours ago

Neat paper. It feels like everyone has been benchmarking strictly on Python for too long, so seeing someone actually push for a multilingual standard that keeps up with fresh competitive programming problems is a welcome change.

I'm curious about the translation process they used to convert the Python tasks into twelve other languages. How do they ensure that the difficulty level and the logic requirements remain consistent across such a diverse set of languages?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/48fc95dc-b07e-4f50-bd09-6170b23ca5cd

dllllb

about 5 hours ago

We translated the evaluation tasks into an input–output format, requiring solutions to read from standard input (stdin) and write to standard output (stdout). We also developed language-specific evaluation scripts, one for each supported programming language. This enables LLMs to solve the same task in any of the supported languages while being evaluated consistently across all of them.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.20517

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20517 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.20517 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20517 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers