Hugging Face Daily Papers · · 3 min read

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/659cf9791c8b66637e3de72d/E06kgfrXyJxVGM6CX8TzK.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/659cf9791c8b66637e3de72d/E06kgfrXyJxVGM6CX8TzK.png\" alt=\"cccacbc8cd20f5ed58144633d50df4dd\"></a></p>\n","updatedAt":"2026-05-21T03:43:56.163Z","author":{"_id":"659cf9791c8b66637e3de72d","avatarUrl":"/avatars/7d26710f687be9444796980662614f16.svg","fullname":"zhiqin yang","name":"visity","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.35830214619636536},"editors":["visity"],"editorAvatarUrls":["/avatars/7d26710f687be9444796980662614f16.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20834","authors":[{"_id":"6a0e7f2b164dbbc68a26c58c","name":"Zhiqin Yang","hidden":false},{"_id":"6a0e7f2b164dbbc68a26c58d","name":"Yonggang Zhang","hidden":false},{"_id":"6a0e7f2b164dbbc68a26c58e","name":"Wei Xue","hidden":false},{"_id":"6a0e7f2b164dbbc68a26c58f","name":"Dong Fang","hidden":false},{"_id":"6a0e7f2b164dbbc68a26c590","name":"Bo Han","hidden":false},{"_id":"6a0e7f2b164dbbc68a26c591","name":"Yike Guo","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment","submittedOnDailyBy":{"_id":"659cf9791c8b66637e3de72d","avatarUrl":"/avatars/7d26710f687be9444796980662614f16.svg","isPro":false,"fullname":"zhiqin yang","user":"visity","type":"user","name":"visity"},"summary":"Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.","upvotes":3,"discussionId":"6a0e7f2b164dbbc68a26c592","ai_summary":"Direct Preference Optimization (DPO) is theoretically equivalent to Reinforcement Learning from Human Feedback (RLHF) only under specific assumptions, otherwise optimizing different objectives; Constrained Preference Optimization (CPO) is proposed as a solution with provable alignment properties.","ai_keywords":["Direct Preference Optimization","Reinforcement Learning from Human Feedback","theoretical equivalence","reference policy","DPO loss","preference optimization","constrained optimization","soft margin ranking","margin ranking","policy optimization"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"659cf9791c8b66637e3de72d","avatarUrl":"/avatars/7d26710f687be9444796980662614f16.svg","isPro":false,"fullname":"zhiqin yang","user":"visity","type":"user"},{"_id":"698f889d4cb527a04987fef7","avatarUrl":"/avatars/45ebe8a216d1c1d171632dcbaba16c60.svg","isPro":false,"fullname":"Xc1hpxn23","user":"xc1hpxn23","type":"user"},{"_id":"6878e75c25d1ed7d2b56a36f","avatarUrl":"/avatars/5056dd5cdaa442b260bc2fd85eea133a.svg","isPro":false,"fullname":"TIANYI","user":"BIMU233","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20834.md"}">
Papers
arxiv:2605.20834

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Published on May 20
· Submitted by
zhiqin yang
on May 21
Authors:
,
,
,
,
,

Abstract

Direct Preference Optimization (DPO) is theoretically equivalent to Reinforcement Learning from Human Feedback (RLHF) only under specific assumptions, otherwise optimizing different objectives; Constrained Preference Optimization (CPO) is proposed as a solution with provable alignment properties.

AI-generated summary

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.

Community

Paper submitter about 9 hours ago

cccacbc8cd20f5ed58144633d50df4dd

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.20834
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.20834 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.20834 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20834 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers