Hugging Face Daily Papers · June 2, 2026 · 4 min read

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00%, and we report this split separately as a targeted stress test. We publicly release our data and code.</p>\n","updatedAt":"2026-06-02T03:28:26.103Z","author":{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","fullname":"Seungone Kim","name":"seungone","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":32,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8770476579666138},"editors":["seungone"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg"],"reactions":[{"reaction":"🚀","users":["BarakAI"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02404","authors":[{"_id":"6a1e4d7c808ddbc3c7d43cf7","name":"Nahyun Lee","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cf8","name":"Dongkeun Yoon","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cf9","name":"Guijin Son","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfa","name":"Geewook Kim","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfb","name":"Dayoon Ko","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfc","name":"Jeonghun Park","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfd","name":"Haneul Yoo","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfe","name":"Jaewon Cho","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cff","name":"Junghun Park","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d00","name":"Changyoon Lee","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d01","name":"Kyochul Jang","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d02","name":"Jaeyeon Kim","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d03","name":"Eunsu Kim","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d04","name":"Woojin Cho","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d05","name":"Seungone Kim","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts","submittedOnDailyBy":{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user","name":"seungone"},"summary":"Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\\%, and we report this split separately as a targeted stress test. We publicly release our data and code.","upvotes":39,"discussionId":"6a1e4d7c808ddbc3c7d43d06","githubRepo":"https://github.com/prometheus-eval/K-BrowseComp","githubRepoAddedBy":"user","ai_summary":"Korean web-browsing agent benchmark K-BrowseComp evaluates frontier LLMs' capabilities with 400 problems, showing significant performance gaps compared to English benchmarks and highlighting the need for more robust Korean AI development.","ai_keywords":["web-browsing agent benchmark","Korean contexts","LLMs","BrowseComp","synthetic split","few-shot exemplars","failure-mode-targeted generation","adversarial filtering"],"githubStars":4,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user"},{"_id":"6576ace7769f3ee9bd7b1b88","avatarUrl":"/avatars/5b5921e54413a37afde6ce017809c86e.svg","isPro":false,"fullname":"Eunsu Kim","user":"EunsuKim","type":"user"},{"_id":"67e7779fb13aed34b4b200a9","avatarUrl":"/avatars/c7142eba473fd5fe9e4f5a1b6a838ed3.svg","isPro":false,"fullname":"NAHYUN LEE","user":"2nhyn","type":"user"},{"_id":"6a1e5098def3d88a550f8676","avatarUrl":"/avatars/7625bbf8384e339edef3bc6febc461c7.svg","isPro":false,"fullname":"Yim hyunsoo","user":"Loopy123","type":"user"},{"_id":"676d35679a4e548c21e3e144","avatarUrl":"/avatars/5dd3046a1af674483dffdea5f0431e2a.svg","isPro":false,"fullname":"TaeyoungKang","user":"Noru-Kang","type":"user"},{"_id":"69c8c409cb7eca755640e163","avatarUrl":"/avatars/52db62f6ed0d85666ac1a5a55ea3ff84.svg","isPro":false,"fullname":"류정찬","user":"etlabjcryu","type":"user"},{"_id":"6a1e518c6be9cd34055faf2a","avatarUrl":"/avatars/f291b7b01ef0299752f651ba1a9d416c.svg","isPro":false,"fullname":"Keunhye Lee","user":"Fibrinogen123","type":"user"},{"_id":"69ae99c872acbc1c21889d2b","avatarUrl":"/avatars/4eccdc7654bc15982b5d78304af2d8e8.svg","isPro":false,"fullname":"jaylacho","user":"jaylacho","type":"user"},{"_id":"6a1e51f0354b17bd6bd566d7","avatarUrl":"/avatars/d26cce2cb9be398f5efbf04844cd1552.svg","isPro":false,"fullname":"Tim minsoo","user":"Weather112358","type":"user"},{"_id":"656a22fa801ed9952f432e69","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656a22fa801ed9952f432e69/w8T3rqAvxCu8xlh049jhh.webp","isPro":false,"fullname":"Kyochul Jang","user":"OfficerChul","type":"user"},{"_id":"67aeb97f7916701a3ef0e46b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/e2t-96tXZgERLfVfH6ruf.png","isPro":false,"fullname":"Jeonghun Park","user":"top321902","type":"user"},{"_id":"6298362c9d3de7b32fd11526","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658473855720-6298362c9d3de7b32fd11526.jpeg","isPro":false,"fullname":"Geewook Kim","user":"gwkrsrch","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02404.md"}">

Papers

arxiv:2606.02404

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Published on Jun 1

· Submitted by

Seungone Kim on Jun 2

Carnegie Mellon University

Upvote

Authors:

Abstract

Korean web-browsing agent benchmark K-BrowseComp evaluates frontier LLMs' capabilities with 400 problems, showing significant performance gaps compared to English benchmarks and highlighting the need for more robust Korean AI development.

AI-generated summary