Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers
Mirrored from arXiv — Machine Learning for archival readability. Support the source by reading on the original site.
Computer Science > Machine Learning
Title:Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers
Abstract:Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.
| Comments: | 32 pages, 3 figures, 13 tables |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML) |
| MSC classes: | 68T07 |
| ACM classes: | I.2.0; I.2.6; I.2.7; G.3 |
| Cite as: | arXiv:2605.20756 [cs.LG] |
| (or arXiv:2605.20756v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.20756
arXiv-issued DOI via DataCite (pending registration)
|
Submission history
From: Nikhil Shivakumar Nayak [view email][v1] Wed, 20 May 2026 05:54:24 UTC (122 KB)
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
Current browse context:
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — Machine Learning
-
Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models
May 21
-
GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation
May 21
-
TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data
May 21
-
Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine
May 21
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.