Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions
Abstract:Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we build question-generating functions to produce random variable questions (RVQs), whose background content mirrors original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLM's genuine reasoning capability is reflected through its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1,000 RVQs. Our findings propose that LLMs exhibit a proficiency imbalance between encountered and ``unseen'' data distributions. Furthermore, RV-Bench reveals that proficiency generalization across similar mathematical reasoning tasks is limited, but we verified it can still be effectively elicited through test-time scaling.
| Comments: | Accepted to AAAI2026 |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2501.11790 [cs.CL] |
| (or arXiv:2501.11790v5 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2501.11790
arXiv-issued DOI via DataCite
|
Submission history
From: Zijin Hong [view email][v1] Mon, 20 Jan 2025 23:41:22 UTC (288 KB)
[v2] Mon, 17 Feb 2025 08:06:13 UTC (705 KB)
[v3] Sat, 15 Mar 2025 09:20:49 UTC (713 KB)
[v4] Wed, 13 Aug 2025 13:29:49 UTC (711 KB)
[v5] Tue, 23 Jun 2026 12:36:04 UTC (578 KB)
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
Generating in the Limit with Infinitely Many Hallucinations
Jun 30
-
Extracting Knowledge from an Arabic-English Machine-Readable Dictionary Using Information Extraction
Jun 30
-
Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models
Jun 30
-
A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training
Jun 30
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.