arXiv — Machine Learning · · 3 min read

Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

Mirrored from arXiv — Machine Learning for archival readability. Support the source by reading on the original site.

Computer Science > Machine Learning

arXiv:2605.30896 (cs)
[Submitted on 29 May 2026]

Title:Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

View a PDF of the paper titled Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments, by Nishant Kumar and 2 other authors
View PDF HTML (experimental)
Abstract:Bidding in repeated auctions is a central challenge for reinforcement learning (RL), combining continuous control with the strategic complexities of digital advertising. While policy gradient and value-based methods seem well-suited for these settings, they often struggle with the discontinuous, "cliff-like" nature of auction reward landscapes. In a first-price auction, for example, a bidder receives zero reward until they cross a specific threshold, after which the reward decreases as the bid increases. This creates a landscape of flat, zero-reward regions separated by sharp boundaries.
We identify a fundamental failure mode in this setting termed "zero collapse." We show that stochastic exploration and gradient-based updates can cause policies to overshoot optimal high-reward regions and enter flat, zero-reward regimes. Once there, the lack of an informative gradient signal makes recovery extremely sample-inefficient, effectively trapping the agent. We find that actor-critic methods are particularly susceptible, as biased value estimates can accelerate this movement toward unstable regions.
Our contributions include: (1) a mechanistic explanation of how discontinuous rewards lead to vanishing signals and zero collapse; (2) an analysis of the interaction between policy stochasticity and step size; and (3) an empirical demonstration of this phenomenon across REINFORCE and actor-critic variants. We propose practical mitigation strategies involving initialization and architectural choices to improve stability. Finally, we introduce a formal RL framework for auction environments highlighting their unique structural properties.
Comments: 20 pages, 7 figures; includes Appendix
Subjects: Machine Learning (cs.LG)
Cite as: arXiv:2605.30896 [cs.LG]
  (or arXiv:2605.30896v1 [cs.LG] for this version)
  https://doi.org/10.48550/arXiv.2605.30896
arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Nishant Kumar [view email]
[v1] Fri, 29 May 2026 06:29:26 UTC (475 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments, by Nishant Kumar and 2 other authors
  • View PDF
  • HTML (experimental)
  • TeX Source

Current browse context:

cs.LG
< prev   |   next >
Change to browse by:
cs

References & Citations

Loading...

BibTeX formatted citation

loading...
Data provided by:

Bookmark

BibSonomy Reddit
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos

Demos

Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers

Recommenders and Search Tools

Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
IArxiv recommender toggle
IArxiv Recommender (What is IArxiv?)
About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from arXiv — Machine Learning