Hugging Face Daily Papers · · 6 min read

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.</p>\n","updatedAt":"2026-06-02T11:54:42.798Z","author":{"_id":"60394599033b61166496163b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1614366097007-noauth.jpeg","fullname":"Gagan Bhatia","name":"gagan3012","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":35,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.927987813949585},"editors":["gagan3012"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1614366097007-noauth.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02255","authors":[{"_id":"6a1ec45d808ddbc3c7d44052","name":"Maria Kunilovskaya","hidden":false},{"_id":"6a1ec45d808ddbc3c7d44053","user":{"_id":"60394599033b61166496163b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1614366097007-noauth.jpeg","isPro":false,"fullname":"Gagan Bhatia","user":"gagan3012","type":"user","name":"gagan3012"},"name":"Gagan Bhatia","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:03:23.523Z","hidden":false},{"_id":"6a1ec45d808ddbc3c7d44054","name":"Lisa Sophie Albertelli","hidden":false},{"_id":"6a1ec45d808ddbc3c7d44055","name":"Yanran Chen","hidden":false},{"_id":"6a1ec45d808ddbc3c7d44056","name":"Christian Greisinger","hidden":false},{"_id":"6a1ec45d808ddbc3c7d44057","name":"Lotta Kiefer","hidden":false},{"_id":"6a1ec45d808ddbc3c7d44058","name":"Christoph Leiter","hidden":false},{"_id":"6a1ec45d808ddbc3c7d44059","name":"Subhadeep Roy","hidden":false},{"_id":"6a1ec45d808ddbc3c7d4405a","name":"Tewodros Achamaleh","hidden":false},{"_id":"6a1ec45d808ddbc3c7d4405b","name":"Muhammad Arslan Manzoor","hidden":false},{"_id":"6a1ec45d808ddbc3c7d4405c","name":"Sebastian Pohl","hidden":false},{"_id":"6a1ec45d808ddbc3c7d4405d","name":"Yufang Hou","hidden":false},{"_id":"6a1ec45d808ddbc3c7d4405e","name":"Steffen Eger","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025","submittedOnDailyBy":{"_id":"60394599033b61166496163b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1614366097007-noauth.jpeg","isPro":false,"fullname":"Gagan Bhatia","user":"gagan3012","type":"user","name":"gagan3012"},"summary":"Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.","upvotes":0,"discussionId":"6a1ec45e808ddbc3c7d4405f","ai_summary":"Large-scale audit of human annotation reporting in NLP reveals inconsistent documentation of critical annotation details, with improvements over time but ongoing gaps in reproducibility and reliability.","ai_keywords":["human annotation","NLP research","annotation reporting","LLM-assisted extraction","Annotated-gold","Annotated-llm","Krippendorff's alpha","reproducibility","annotation validity"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"62e4043244ed8768d1d4dda7","name":"nllg","fullname":"Natural Language Learning & Generation Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659111761277-62e402ec6a82e063860729f4.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"62e4043244ed8768d1d4dda7","name":"nllg","fullname":"Natural Language Learning & Generation Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659111761277-62e402ec6a82e063860729f4.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02255.md"}">
Papers
arxiv:2606.02255

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Published on Jun 1
· Submitted by
Gagan Bhatia
on Jun 2
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Large-scale audit of human annotation reporting in NLP reveals inconsistent documentation of critical annotation details, with improvements over time but ongoing gaps in reproducibility and reliability.

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

Community

Paper author Paper submitter about 14 hours ago

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.02255
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02255 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02255 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02255 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers