Hugging Face Daily Papers · · 10 min read

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/63724cfada3183d9d53f2009/dS87gXHf6dLiKClDBAcds.png\" alt=\"Harness-1 performance\" width=\"850\">\n</p>\n\n### Motivation\n\nMany search agents are trained over growing transcripts. As a result, the model has to search while also doing a lot of implicit bookkeeping:\n\n* remembering candidate documents,\n* tracking useful evidence,\n* maintaining verification status,\n* recalling search history,\n* and avoiding repeatedly revisiting what has already been seen.\n\nThis makes the model responsible not only for search decisions, but also for managing the entire search state inside its context.\n\n### Key idea\n\n**Harness-1 separates these responsibilities.**\n\nThe policy still makes the semantic decisions:\n\n* what to search,\n* what to inspect,\n* what to curate,\n* what to verify,\n* and when to stop.\n\nBut the harness maintains the recoverable search state around those decisions, including candidate pools, curated evidence, evidence links, verification records, and budget-aware context rendering.\n\nWith this setup, RL does not need to teach the model to manage an unstructured transcript from scratch. Instead, it trains the model to operate over a structured search workspace.\n\n### Results\n\nAcross 8 difficult retrieval benchmarks, **Harness-1 reaches 0.730 average curated recall**, outperforming the next strongest open search subagent by **+11.4 points**, while remaining competitive with much larger frontier-model searchers.\n\nThe most interesting result to us is transfer: the gains are substantially larger on held-out transfer benchmarks than on source-family benchmarks. Ablations also show that removing the harness mechanisms changes agent behavior and hurts recall.\n\n### Takeaway\n\nFor search agents, the model is not the whole learning system.\n\nThe harness — memory layout, action space, curation interface, verification records, and context rendering — is part of what RL learns to use.","html":"<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"🔥-introducing-harness-1-🔥\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🔥-introducing-harness-1-🔥\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🔥 Introducing Harness-1 🔥\n\t</span>\n</h2>\n<p><strong>Harness-1</strong> is a 20B open search agent trained with <strong>state-externalizing harnesses</strong>, matching or outperforming several much larger frontier-model searchers on difficult retrieval tasks.</p>\n<p align=\"center\">\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/63724cfada3183d9d53f2009/dS87gXHf6dLiKClDBAcds.png\" alt=\"Harness-1 performance\" width=\"850\">\n</p>\n\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"motivation\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#motivation\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tMotivation\n\t</span>\n</h3>\n<p>Many search agents are trained over growing transcripts. As a result, the model has to search while also doing a lot of implicit bookkeeping:</p>\n<ul>\n<li>remembering candidate documents,</li>\n<li>tracking useful evidence,</li>\n<li>maintaining verification status,</li>\n<li>recalling search history,</li>\n<li>and avoiding repeatedly revisiting what has already been seen.</li>\n</ul>\n<p>This makes the model responsible not only for search decisions, but also for managing the entire search state inside its context.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"key-idea\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#key-idea\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tKey idea\n\t</span>\n</h3>\n<p><strong>Harness-1 separates these responsibilities.</strong></p>\n<p>The policy still makes the semantic decisions:</p>\n<ul>\n<li>what to search,</li>\n<li>what to inspect,</li>\n<li>what to curate,</li>\n<li>what to verify,</li>\n<li>and when to stop.</li>\n</ul>\n<p>But the harness maintains the recoverable search state around those decisions, including candidate pools, curated evidence, evidence links, verification records, and budget-aware context rendering.</p>\n<p>With this setup, RL does not need to teach the model to manage an unstructured transcript from scratch. Instead, it trains the model to operate over a structured search workspace.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"results\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#results\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tResults\n\t</span>\n</h3>\n<p>Across 8 difficult retrieval benchmarks, <strong>Harness-1 reaches 0.730 average curated recall</strong>, outperforming the next strongest open search subagent by <strong>+11.4 points</strong>, while remaining competitive with much larger frontier-model searchers.</p>\n<p>The most interesting result to us is transfer: the gains are substantially larger on held-out transfer benchmarks than on source-family benchmarks. Ablations also show that removing the harness mechanisms changes agent behavior and hurts recall.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"takeaway\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#takeaway\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tTakeaway\n\t</span>\n</h3>\n<p>For search agents, the model is not the whole learning system.</p>\n<p>The harness — memory layout, action space, curation interface, verification records, and context rendering — is part of what RL learns to use.</p>\n","updatedAt":"2026-06-02T15:45:03.669Z","author":{"_id":"63724cfada3183d9d53f2009","avatarUrl":"/avatars/17838fcf244ecf8d139343bb6c6d8562.svg","fullname":"Patrick Jiang","name":"pat-jj","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8806959390640259},"editors":["pat-jj"],"editorAvatarUrls":["/avatars/17838fcf244ecf8d139343bb6c6d8562.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02373","authors":[{"_id":"6a1e5f18808ddbc3c7d43e0e","name":"Pengcheng Jiang","hidden":false},{"_id":"6a1e5f18808ddbc3c7d43e0f","name":"Zhiyi Shi","hidden":false},{"_id":"6a1e5f18808ddbc3c7d43e10","name":"Kelly Hong","hidden":false},{"_id":"6a1e5f18808ddbc3c7d43e11","name":"Xueqiang Xu","hidden":false},{"_id":"6a1e5f18808ddbc3c7d43e12","name":"Jiashuo Sun","hidden":false},{"_id":"6a1e5f18808ddbc3c7d43e13","name":"Jimeng Sun","hidden":false},{"_id":"6a1e5f18808ddbc3c7d43e14","name":"Hammad Bashir","hidden":false},{"_id":"6a1e5f18808ddbc3c7d43e15","name":"Jiawei Han","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses","submittedOnDailyBy":{"_id":"63724cfada3183d9d53f2009","avatarUrl":"/avatars/17838fcf244ecf8d139343bb6c6d8562.svg","isPro":false,"fullname":"Patrick Jiang","user":"pat-jj","type":"user","name":"pat-jj"},"summary":"Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.","upvotes":31,"discussionId":"6a1e5f18808ddbc3c7d43e16","githubRepo":"https://github.com/pat-jj/harness-1","githubRepoAddedBy":"user","ai_summary":"A 20B search agent trained with reinforcement learning within a stateful search framework demonstrates superior retrieval performance across multiple domains by separating semantic decision-making from environmental bookkeeping.","ai_keywords":["search agents","reinforcement learning","stateful search harness","working memory","candidate pool","curated set","evidence links","verification records","context rendering","curated recall","retrieval benchmarks","multi-hop QA"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"649dae523830b99cdb7be0dd","name":"chromadb","fullname":"chroma","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6303e5834ec2dfa82a56d18b/W6hvKAy_-lRcBPAJ3Vq6W.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63724cfada3183d9d53f2009","avatarUrl":"/avatars/17838fcf244ecf8d139343bb6c6d8562.svg","isPro":false,"fullname":"Patrick Jiang","user":"pat-jj","type":"user"},{"_id":"66d4af28033492801d82b890","avatarUrl":"/avatars/5e8a2dc1b932a679341976d11b22f6c8.svg","isPro":false,"fullname":"shi","user":"Gabshi","type":"user"},{"_id":"652ef1a157a8ba396c6d2561","avatarUrl":"/avatars/057e3fee63257c3069328b1746206a2e.svg","isPro":false,"fullname":"Jimeng Shi","user":"jimeng008","type":"user"},{"_id":"6680f0b20b72be136708af26","avatarUrl":"/avatars/5d8fd5be0cf94e246b46abb9d3cc8f5c.svg","isPro":false,"fullname":"XuQixin","user":"Racktic","type":"user"},{"_id":"6803f10acc73e62e2f4ca1fe","avatarUrl":"/avatars/b7e428532b3fbadfbdb34268adefddaf.svg","isPro":false,"fullname":"Mark","user":"Makrrr","type":"user"},{"_id":"66349404f2c753240d02952a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66349404f2c753240d02952a/xKBKicwyk7BoOITQPwBJn.png","isPro":false,"fullname":"ZhuofengLi","user":"ZhuofengLi","type":"user"},{"_id":"64913f1b24d9bc9bb8ff407e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64913f1b24d9bc9bb8ff407e/N1cdMd9_DJb5GymdKJ3Mb.jpeg","isPro":false,"fullname":"Haoxiang Zhang","user":"IPF","type":"user"},{"_id":"66f591cfda87022480974c09","avatarUrl":"/avatars/4c06aa2f9d1be8b0d2a20f3a6035a381.svg","isPro":false,"fullname":"ruike zhu","user":"taoci2024","type":"user"},{"_id":"652035c4a692c870ea7b1b13","avatarUrl":"/avatars/636a99d5c72d9d15a3d95d9e84b4a137.svg","isPro":false,"fullname":"Lang Cao","user":"windszzlang","type":"user"},{"_id":"630f0628a119d49bc1ddbba9","avatarUrl":"/avatars/fee1dff2bc1b10aa420d02550728aa2c.svg","isPro":false,"fullname":"Hammad","user":"hammadtime","type":"user"},{"_id":"66a3f1c4c38ce500371fd8d4","avatarUrl":"/avatars/381de938091f1a5c179eef72aa247bbf.svg","isPro":false,"fullname":"Xueqiang Xu","user":"XueqiangXu","type":"user"},{"_id":"69456ca12e574646a580840a","avatarUrl":"/avatars/0980a2cff8dc43dd511756261808d85e.svg","isPro":false,"fullname":"ghost","user":"ghostymk","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"649dae523830b99cdb7be0dd","name":"chromadb","fullname":"chroma","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6303e5834ec2dfa82a56d18b/W6hvKAy_-lRcBPAJ3Vq6W.png"}}">
Papers
arxiv:2606.02373

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Published on Jun 1
· Submitted by
Patrick Jiang
on Jun 2
Authors:
,
,
,
,
,
,
,

Abstract

A 20B search agent trained with reinforcement learning within a stateful search framework demonstrates superior retrieval performance across multiple domains by separating semantic decision-making from environmental bookkeeping.

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

Community

🔥 Introducing Harness-1 🔥

Harness-1 is a 20B open search agent trained with state-externalizing harnesses, matching or outperforming several much larger frontier-model searchers on difficult retrieval tasks.

Harness-1 performance

Motivation

Many search agents are trained over growing transcripts. As a result, the model has to search while also doing a lot of implicit bookkeeping:

  • remembering candidate documents,
  • tracking useful evidence,
  • maintaining verification status,
  • recalling search history,
  • and avoiding repeatedly revisiting what has already been seen.

This makes the model responsible not only for search decisions, but also for managing the entire search state inside its context.

Key idea

Harness-1 separates these responsibilities.

The policy still makes the semantic decisions:

  • what to search,
  • what to inspect,
  • what to curate,
  • what to verify,
  • and when to stop.

But the harness maintains the recoverable search state around those decisions, including candidate pools, curated evidence, evidence links, verification records, and budget-aware context rendering.

With this setup, RL does not need to teach the model to manage an unstructured transcript from scratch. Instead, it trains the model to operate over a structured search workspace.

Results

Across 8 difficult retrieval benchmarks, Harness-1 reaches 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points, while remaining competitive with much larger frontier-model searchers.

The most interesting result to us is transfer: the gains are substantially larger on held-out transfer benchmarks than on source-family benchmarks. Ablations also show that removing the harness mechanisms changes agent behavior and hurts recall.

Takeaway

For search agents, the model is not the whole learning system.

The harness — memory layout, action space, curation interface, verification records, and context rendering — is part of what RL learns to use.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02373 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02373 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers