Hugging Face Daily Papers · · 4 min read

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00%, and we report this split separately as a targeted stress test. We publicly release our data and code.</p>\n","updatedAt":"2026-06-02T03:28:26.103Z","author":{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","fullname":"Seungone Kim","name":"seungone","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":32,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8770476579666138},"editors":["seungone"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg"],"reactions":[{"reaction":"🚀","users":["BarakAI"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02404","authors":[{"_id":"6a1e4d7c808ddbc3c7d43cf7","name":"Nahyun Lee","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cf8","name":"Dongkeun Yoon","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cf9","name":"Guijin Son","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfa","name":"Geewook Kim","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfb","name":"Dayoon Ko","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfc","name":"Jeonghun Park","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfd","name":"Haneul Yoo","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cfe","name":"Jaewon Cho","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43cff","name":"Junghun Park","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d00","name":"Changyoon Lee","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d01","name":"Kyochul Jang","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d02","name":"Jaeyeon Kim","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d03","name":"Eunsu Kim","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d04","name":"Woojin Cho","hidden":false},{"_id":"6a1e4d7c808ddbc3c7d43d05","name":"Seungone Kim","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts","submittedOnDailyBy":{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user","name":"seungone"},"summary":"Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\\%, and we report this split separately as a targeted stress test. We publicly release our data and code.","upvotes":39,"discussionId":"6a1e4d7c808ddbc3c7d43d06","githubRepo":"https://github.com/prometheus-eval/K-BrowseComp","githubRepoAddedBy":"user","ai_summary":"Korean web-browsing agent benchmark K-BrowseComp evaluates frontier LLMs' capabilities with 400 problems, showing significant performance gaps compared to English benchmarks and highlighting the need for more robust Korean AI development.","ai_keywords":["web-browsing agent benchmark","Korean contexts","LLMs","BrowseComp","synthetic split","few-shot exemplars","failure-mode-targeted generation","adversarial filtering"],"githubStars":4,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user"},{"_id":"6576ace7769f3ee9bd7b1b88","avatarUrl":"/avatars/5b5921e54413a37afde6ce017809c86e.svg","isPro":false,"fullname":"Eunsu Kim","user":"EunsuKim","type":"user"},{"_id":"67e7779fb13aed34b4b200a9","avatarUrl":"/avatars/c7142eba473fd5fe9e4f5a1b6a838ed3.svg","isPro":false,"fullname":"NAHYUN LEE","user":"2nhyn","type":"user"},{"_id":"6a1e5098def3d88a550f8676","avatarUrl":"/avatars/7625bbf8384e339edef3bc6febc461c7.svg","isPro":false,"fullname":"Yim hyunsoo","user":"Loopy123","type":"user"},{"_id":"676d35679a4e548c21e3e144","avatarUrl":"/avatars/5dd3046a1af674483dffdea5f0431e2a.svg","isPro":false,"fullname":"TaeyoungKang","user":"Noru-Kang","type":"user"},{"_id":"69c8c409cb7eca755640e163","avatarUrl":"/avatars/52db62f6ed0d85666ac1a5a55ea3ff84.svg","isPro":false,"fullname":"류정찬","user":"etlabjcryu","type":"user"},{"_id":"6a1e518c6be9cd34055faf2a","avatarUrl":"/avatars/f291b7b01ef0299752f651ba1a9d416c.svg","isPro":false,"fullname":"Keunhye Lee","user":"Fibrinogen123","type":"user"},{"_id":"69ae99c872acbc1c21889d2b","avatarUrl":"/avatars/4eccdc7654bc15982b5d78304af2d8e8.svg","isPro":false,"fullname":"jaylacho","user":"jaylacho","type":"user"},{"_id":"6a1e51f0354b17bd6bd566d7","avatarUrl":"/avatars/d26cce2cb9be398f5efbf04844cd1552.svg","isPro":false,"fullname":"Tim minsoo","user":"Weather112358","type":"user"},{"_id":"656a22fa801ed9952f432e69","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656a22fa801ed9952f432e69/w8T3rqAvxCu8xlh049jhh.webp","isPro":false,"fullname":"Kyochul Jang","user":"OfficerChul","type":"user"},{"_id":"67aeb97f7916701a3ef0e46b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/e2t-96tXZgERLfVfH6ruf.png","isPro":false,"fullname":"Jeonghun Park","user":"top321902","type":"user"},{"_id":"6298362c9d3de7b32fd11526","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658473855720-6298362c9d3de7b32fd11526.jpeg","isPro":false,"fullname":"Geewook Kim","user":"gwkrsrch","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02404.md"}">
Papers
arxiv:2606.02404

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Published on Jun 1
· Submitted by
Seungone Kim
on Jun 2
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Korean web-browsing agent benchmark K-BrowseComp evaluates frontier LLMs' capabilities with 400 problems, showing significant performance gaps compared to English benchmarks and highlighting the need for more robust Korean AI development.

AI-generated summary

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

Community

Paper submitter about 7 hours ago

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00%, and we report this split separately as a targeted stress test. We publicly release our data and code.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.02404
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02404 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02404 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers