A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data
Mirrored from arXiv — Machine Learning for archival readability. Support the source by reading on the original site.
Computer Science > Machine Learning
Title:A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data
Abstract:Raw numerical datasets remain less systematically examined in integrity screening than images, plagiarism, or summary-statistic inconsistencies. We developed the Fabrication-risk Digit Randomness Screening model (FDRS), a statistical and machine-learning framework for detecting non-random digit-pattern irregularities in numerical research data. FDRS integrates single- and joint-decimal-digit tests, Cramer's V, entropy metrics, Kullback-Leibler divergence, digit-preference indices, progressive subsampling, and semi-supervised risk scoring. It was evaluated using an instrument-derived enzymatic absorbance dataset (RawData, n=253) and a blinded manually simulated irregular dataset (ErrData, n=255). RawData showed no significant deviation in single third-decimal-digit analysis, whereas ErrData showed a significant deviation. In joint third-fourth decimal digit analysis, ErrData showed higher Cramer's V, lower normalized entropy, higher KL divergence, and a more persistent progressive-subsampling deviation signal. In internal validation, Elastic-net Logistic Regression achieved the highest AUC (0.98395) and lowest Brier score (0.048439), while Random Forest achieved the highest accuracy (0.926667) and balanced accuracy (0.935). RawData received a low ensemble risk score of 0.124627 and was classified as Grade 0; ErrData received a score of 0.740760 and was classified as Grade 3. External real-world benchmarks supported graded risk stratification: three datasets without identified public post-publication concerns were classified as Grade 0 or 1, whereas two datasets from publicly questioned or institutionally handled articles were classified as Grade 2 or 3. FDRS can prioritize raw numerical datasets for further review by integrating interpretable statistical and machine-learning features. It is an auxiliary digit-structure screening tool, not standalone evidence of fabrication or misconduct.
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2606.07128 [cs.LG] |
| (or arXiv:2606.07128v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2606.07128
arXiv-issued DOI via DataCite (pending registration)
|
Access Paper:
- View PDF
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — Machine Learning
-
Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
Jun 8
-
FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
Jun 8
-
Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy
Jun 8
-
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
Jun 8
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.