A cross-domain tropical species dataset with Chinese vernacular names and CITES source links
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:A cross-domain tropical species dataset with Chinese vernacular names and CITES source links
Abstract:We describe a versioned cross-domain dataset of 410,499 active tropical species (working snapshot 2026-04-20) spanning three applied subdomains -- tropical_plants, tropical_aquatic, and tropical_pets -- that share a commercial and regulatory life cycle but are distributed across kingdom-organised biodiversity infrastructures. The resource joins taxonomic identifiers from GBIF, Plants of the World Online, iNaturalist, NCBI Taxonomy, the Catalogue of Life and the Encyclopedia of Life, and adds three original layers: a cross-domain ontology that re-segments taxa along trade and husbandry contexts; a Chinese vernacular layer with explicit per-name provenance under a typology that excludes unverified machine-generated proposals; and a CITES source-linkage layer connecting each taxon to its Species+ entry. Chinese vernacular coverage -- the proportion of taxa carrying a CJK Chinese name distinct from the scientific binomial -- reaches 99.50 percent (408,456 of 410,499; full-population count). Coverage characterises completeness, not name-translation accuracy; the latter is bounded by the four-level provenance typology and is the subject of a preliminary internal review reported here, with a blind external audit identified as the principal open item. Upstream content is referenced by stable identifier only for the original-contribution layers, supporting CC-BY 4.0 reuse. The dataset is deposited on Zenodo (https://doi.org/10.5281/zenodo.20377811). This preprint is the canonical v1.0 description of the dataset's current state; future Data Descriptor submission is anticipated but is contingent on the validation and release-engineering items listed in the Limitations.
| Comments: | 25 pages, 4 figures, 4 tables. Dataset descriptor for the Tropical Species Encyclopedia. Companion to the methodology paper arXiv:2606.00994. Dataset deposited at Zenodo (doi:https://doi.org/10.5281/zenodo.20377811%29%3B canonical preprint-of-record at Zenodo (doi:https://doi.org/10.5281/zenodo.20424981) |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.03156 [cs.CL] |
| (or arXiv:2606.03156v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.03156
arXiv-issued DOI via DataCite (pending registration)
|
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs
Jun 3
-
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
Jun 3
-
IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation
Jun 3
-
Greener Than Humans? Environmental Attitudes in Large Language Models
Jun 3
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.