Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Abstract:Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs across diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2506.01732 [cs.CL] |
| (or arXiv:2506.01732v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2506.01732
arXiv-issued DOI via DataCite
|
Submission history
From: Pavel Chizhov [view email][v1] Mon, 2 Jun 2025 14:43:15 UTC (695 KB)
[v2] Mon, 20 Apr 2026 13:44:50 UTC (828 KB)
[v3] Fri, 15 May 2026 14:33:53 UTC (828 KB)
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints
May 20
-
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
May 20
-
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking
May 20
-
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
May 20
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.