This paper studies offline reinforcement learning when supervision is available only at the trajectory level, rather than as per-step rewards. It asks when such outcome-level supervision is statistically sufficient for efficient policy optimization, and where the difficulty comes from: offline distribution shift and recovering latent per-step rewards from aggregated labels.</p>\n<p>The paper proposes OPAC, a pessimistic actor-critic algorithm that learns a latent reward model from trajectory-level labels and optimizes policies under offline data coverage. It also gives matching upper and lower bounds, and extends the theory to preference feedback and more general trajectory-level objectives.</p>\n","updatedAt":"2026-06-18T19:46:10.626Z","author":{"_id":"661f79477273a6cb227f3b30","avatarUrl":"/avatars/40aabf64de9755da04ef38a53b6a1894.svg","fullname":"Xuanfei Ren","name":"xuanfeiren","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8894021511077881},"editors":["xuanfeiren"],"editorAvatarUrls":["/avatars/40aabf64de9755da04ef38a53b6a1894.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18531","authors":[{"_id":"6a344a534c5c5e0d69bf1b04","name":"Xuanfei Ren","hidden":false},{"_id":"6a344a534c5c5e0d69bf1b05","name":"Tengyang Xie","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?","submittedOnDailyBy":{"_id":"661f79477273a6cb227f3b30","avatarUrl":"/avatars/40aabf64de9755da04ef38a53b6a1894.svg","isPro":false,"fullname":"Xuanfei Ren","user":"xuanfeiren","type":"user","name":"xuanfeiren"},"summary":"Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets\n record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level\n supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory\n provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm\n that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order\n widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing\n process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the\n leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline\n RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step\n rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H)\n trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two\n structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and\n generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when\n outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental\n statistical barriers.","upvotes":3,"discussionId":"6a344a534c5c5e0d69bf1b06","ai_summary":"Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems.","ai_keywords":["offline reinforcement learning","policy optimization","trajectory-level outcomes","process-level reward supervision","OPAC","pessimistic actor-critic","latent reward model","high-probability guarantee","lower bound","preference-based feedback","concentrability","generalized outcome-based RL","Bellman updates","sample complexity"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6318959fda3063b19c1c1d9b","name":"Wisconsin","fullname":"University of Wisconsin - Madison","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/644645655004f2cb3aefc452/UqU99v2mCOrNNsD8hYv5Q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"661f79477273a6cb227f3b30","avatarUrl":"/avatars/40aabf64de9755da04ef38a53b6a1894.svg","isPro":false,"fullname":"Xuanfei Ren","user":"xuanfeiren","type":"user"},{"_id":"66cb5332e0cdc14fc093361d","avatarUrl":"/avatars/1b95aa71b9aeaa2a0d1dac59535ddd1d.svg","isPro":false,"fullname":"Hanyu Wang","user":"hywww24","type":"user"},{"_id":"641a5e544182690729c7fcab","avatarUrl":"/avatars/7437295e2ab5eaca2ec58b2f57a70037.svg","isPro":false,"fullname":"Avi Trost","user":"atrost","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6318959fda3063b19c1c1d9b","name":"Wisconsin","fullname":"University of Wisconsin - Madison","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/644645655004f2cb3aefc452/UqU99v2mCOrNNsD8hYv5Q.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18531.md","query":{}}">
When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
Abstract
Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems.
Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets
record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level
supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory
provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm
that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order
widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing
process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the
leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline
RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step
rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H)
trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two
structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and
generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when
outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental
statistical barriers.
Community
This paper studies offline reinforcement learning when supervision is available only at the trajectory level, rather than as per-step rewards. It asks when such outcome-level supervision is statistically sufficient for efficient policy optimization, and where the difficulty comes from: offline distribution shift and recovering latent per-step rewards from aggregated labels.
The paper proposes OPAC, a pessimistic actor-critic algorithm that learns a latent reward model from trajectory-level labels and optimizes policies under offline data coverage. It also gives matching upper and lower bounds, and extends the theory to preference feedback and more general trajectory-level objectives.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.18531 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.18531 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.18531 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.