Hugging Face Daily Papers · June 9, 2026 · 4 min read

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Trace tournaments can provide denser reward signals when verifier rewards are identical.</p>\n","updatedAt":"2026-06-09T06:54:04.074Z","author":{"_id":"62b279e92375526ae51a537b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b279e92375526ae51a537b/U2DxDscDjQ6kWh-jMn0IG.jpeg","fullname":"Han Zhou","name":"hzhouml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8173267841339111},"editors":["hzhouml"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b279e92375526ae51a537b/U2DxDscDjQ6kWh-jMn0IG.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09380","authors":[{"_id":"6a27b7a46dde1c5ef75bd1c3","name":"Han Zhou","hidden":false},{"_id":"6a27b7a46dde1c5ef75bd1c4","name":"Adam X. Yang","hidden":false},{"_id":"6a27b7a46dde1c5ef75bd1c5","name":"Laurence Aitchison","hidden":false},{"_id":"6a27b7a46dde1c5ef75bd1c6","name":"Anna Korhonen","hidden":false},{"_id":"6a27b7a46dde1c5ef75bd1c7","name":"Albert Q. Jiang","hidden":false}],"publishedAt":"2026-06-08T11:57:17.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short","submittedOnDailyBy":{"_id":"62b279e92375526ae51a537b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b279e92375526ae51a537b/U2DxDscDjQ6kWh-jMn0IG.jpeg","isPro":false,"fullname":"Han Zhou","user":"hzhouml","type":"user","name":"hzhouml"},"summary":"Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.","upvotes":3,"discussionId":"6a27b7a56dde1c5ef75bd1c8","ai_summary":"Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training and better reasoning performance.","ai_keywords":["reinforcement learning","verifiable rewards","reasoning ability","outcome-based supervision","group-relative advantage estimation","judge system","trace tournaments","Bradley-Terry model","relative reward signals","scalable RL integration","zero-advantage samples"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"64edf4004f42c35eea1b1632","name":"mistralai","fullname":"Mistral AI_","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/634c17653d11eaedd88b314d/9OgyfKstSZtbmsmuG8MbU.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62b279e92375526ae51a537b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b279e92375526ae51a537b/U2DxDscDjQ6kWh-jMn0IG.jpeg","isPro":false,"fullname":"Han Zhou","user":"hzhouml","type":"user"},{"_id":"670f609090379f8b59bf03d7","avatarUrl":"/avatars/d1c5b38fa744ef49c2a2aaceccb71615.svg","isPro":false,"fullname":"Zhu","user":"Boyu123","type":"user"},{"_id":"63b6af3accebeadccc868efd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b6af3accebeadccc868efd/cFTHKggMpsoaPe_46gcy9.webp","isPro":false,"fullname":"Zhijiang","user":"Zeee","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64edf4004f42c35eea1b1632","name":"mistralai","fullname":"Mistral AI_","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/634c17653d11eaedd88b314d/9OgyfKstSZtbmsmuG8MbU.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09380.md"}">

Papers

arxiv:2606.09380

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Published on Jun 8

· Submitted by

Han Zhou on Jun 9

Mistral AI_

Upvote

Authors:

Abstract

Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training and better reasoning performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

View arXiv page View PDF Add to collection

Community

hzhouml

Paper submitter about 1 hour ago

Trace tournaments can provide denser reward signals when verifier rewards are identical.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.09380

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09380 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09380 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09380 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers