Hugging Face Daily Papers · May 14, 2026 · 4 min read

RewardHarness: Self-Evolving Agentic Post-Training

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Project: <a href=\"https://rewardharness.com/\" rel=\"nofollow\">https://rewardharness.com/</a></p>\n","updatedAt":"2026-05-14T17:35:19.790Z","author":{"_id":"691e5f168f82cd99d66df74d","avatarUrl":"/avatars/ca614ef49da66cdb3f1d5ac07118ed9f.svg","fullname":"Perry the Platypus","name":"AgPerry","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9051200747489929},"editors":["AgPerry"],"editorAvatarUrls":["/avatars/ca614ef49da66cdb3f1d5ac07118ed9f.svg"],"reactions":[],"isReport":false}},{"id":"6a06b3ce998a2224524fe097","author":{"_id":"647076467fd7ecdbd0ea03b1","avatarUrl":"/avatars/6e090ea5f88977c6f70544175094c2a6.svg","fullname":"Penghui Du","name":"eternaldolphin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-15T05:49:02.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"\n![image](https://cdn-uploads.huggingface.co/production/uploads/647076467fd7ecdbd0ea03b1/i36IxJjM3ZeowJvvEaix6.png)\n","html":"<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/647076467fd7ecdbd0ea03b1/i36IxJjM3ZeowJvvEaix6.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/647076467fd7ecdbd0ea03b1/i36IxJjM3ZeowJvvEaix6.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-05-15T05:50:57.576Z","author":{"_id":"647076467fd7ecdbd0ea03b1","avatarUrl":"/avatars/6e090ea5f88977c6f70544175094c2a6.svg","fullname":"Penghui Du","name":"eternaldolphin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5347595810890198},"editors":["eternaldolphin"],"editorAvatarUrls":["/avatars/6e090ea5f88977c6f70544175094c2a6.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08703","authors":[{"_id":"6a0558e9b1a8cbabc9f088ba","name":"Yuxuan Zhang","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088bb","name":"Penghui Du","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088bc","name":"Bo Li","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088bd","name":"Cong Wei","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088be","name":"Junwen Miao","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088bf","name":"Huaisong Zhang","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088c0","name":"Songcheng Cai","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088c1","name":"Yubo Wang","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088c2","name":"Dongfu Jiang","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088c3","name":"Yuyu Zhang","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088c4","name":"Ping Nie","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088c5","name":"Wenhu Chen","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088c6","name":"Changqian Yu","hidden":false},{"_id":"6a0558e9b1a8cbabc9f088c7","name":"Kelsey R. Allen","hidden":false}],"publishedAt":"2026-05-09T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"RewardHarness: Self-Evolving Agentic Post-Training","submittedOnDailyBy":{"_id":"647076467fd7ecdbd0ea03b1","avatarUrl":"/avatars/6e090ea5f88977c6f70544175094c2a6.svg","isPro":false,"fullname":"Penghui Du","user":"eternaldolphin","type":"user","name":"eternaldolphin"},"summary":"Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.","upvotes":2,"discussionId":"6a0558e9b1a8cbabc9f088c8","projectPage":"https://rewardharness.com/","githubRepo":"https://github.com/TIGER-AI-Lab/RewardHarness","githubRepoAddedBy":"user","ai_summary":"RewardHarness is a self-evolving framework that improves image edit evaluation by iteratively developing tools and skills from limited human demonstrations, achieving superior performance compared to existing models.","ai_keywords":["reward modeling","preference annotation","reward models","instruction-guided image edits","agentic reward framework","context evolution","tool selection","skill refinement","reasoning chain","reinforcement learning","GRPO fine-tuning","ImgEdit-Bench"],"githubStars":3,"organization":{"_id":"69d62cbdf7867b1e118f3eae","name":"NAIL-Group","fullname":"Natural and Artificial Intelligence Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/691e5f168f82cd99d66df74d/rzTMt3tlNbE-I0sVgFiSi.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"647076467fd7ecdbd0ea03b1","avatarUrl":"/avatars/6e090ea5f88977c6f70544175094c2a6.svg","isPro":false,"fullname":"Penghui Du","user":"eternaldolphin","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69d62cbdf7867b1e118f3eae","name":"NAIL-Group","fullname":"Natural and Artificial Intelligence Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/691e5f168f82cd99d66df74d/rzTMt3tlNbE-I0sVgFiSi.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.08703.md"}">

Papers

arxiv:2605.08703

RewardHarness: Self-Evolving Agentic Post-Training

Published on May 9

· Submitted by

Penghui Du on May 15

Natural and Artificial Intelligence Lab

Upvote

Authors:

Abstract

RewardHarness is a self-evolving framework that improves image edit evaluation by iteratively developing tools and skills from limited human demonstrations, achieving superior performance compared to existing models.

AI-generated summary

Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.