Hugging Face Daily Papers · June 22, 2026 · 3 min read

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Where do narratives live in pretraining data? Check out this paper to find out!</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/660c6710a0190686200da046/B0qFasiQkWwxn39nVlCgS.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/660c6710a0190686200da046/B0qFasiQkWwxn39nVlCgS.png\" alt=\"topic_pc_quartile_10feat\"></a></p>\n","updatedAt":"2026-06-22T20:33:15.073Z","author":{"_id":"660c6710a0190686200da046","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/YyjdSKGTck4p4nn-juvQL.png","fullname":"Teagan Johnson","name":"teagrjohnson","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6729328036308289},"editors":["teagrjohnson"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/YyjdSKGTck4p4nn-juvQL.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19468","authors":[{"_id":"6a358da0db23715e9da12d02","user":{"_id":"660c6710a0190686200da046","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/YyjdSKGTck4p4nn-juvQL.png","isPro":false,"fullname":"Teagan Johnson","user":"teagrjohnson","type":"user","name":"teagrjohnson"},"name":"Teagan Johnson","status":"claimed_verified","statusLastChangedAt":"2026-06-22T16:14:48.272Z","hidden":false},{"_id":"6a358da0db23715e9da12d03","name":"Elliott Ash","hidden":false},{"_id":"6a358da0db23715e9da12d04","name":"Andrew Piper","hidden":false},{"_id":"6a358da0db23715e9da12d05","name":"Maria Antoniak","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-22T00:00:00.000Z","title":"Characterizing Narrative Content in Web-scale LLM Pretraining Data","submittedOnDailyBy":{"_id":"660c6710a0190686200da046","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/YyjdSKGTck4p4nn-juvQL.png","isPro":false,"fullname":"Teagan Johnson","user":"teagrjohnson","type":"user","name":"teagrjohnson"},"summary":"The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.","upvotes":3,"discussionId":"6a358da0db23715e9da12d06","projectPage":"https://huggingface.co/collections/teagrjohnson/narratives-in-llm-pretraining-data","githubRepo":"https://github.com/johnsont4/narratives_in_pretraining_data_release","githubRepoAddedBy":"user","ai_summary":"A comprehensive analysis of narrative structures in large-scale language model training data reveals measurable, multidimensional narrative patterns that vary across different content sources and topics.","ai_keywords":["NarraBERT","RoBERTa","Dolma","NarraDolma","narrative theory","agency","setting","events","fine-grained analysis","pretraining corpus","language model training data"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"6a38ce5bcb150490d9676143","name":"CLS-Lab","fullname":"Culture, Language, & Systems Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65a99df0d6fc9616a6439335/Q1o7tA8PHGEqKoI5Jnxll.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"660c6710a0190686200da046","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/YyjdSKGTck4p4nn-juvQL.png","isPro":false,"fullname":"Teagan Johnson","user":"teagrjohnson","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a38ce5bcb150490d9676143","name":"CLS-Lab","fullname":"Culture, Language, & Systems Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65a99df0d6fc9616a6439335/Q1o7tA8PHGEqKoI5Jnxll.png"},"query":{}}">

Papers

arxiv:2606.19468

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Published on Jun 17

· Submitted by

Teagan Johnson on Jun 22

Culture, Language, & Systems Lab

Upvote

Authors:

Teagan Johnson ,

Abstract

A comprehensive analysis of narrative structures in large-scale language model training data reveals measurable, multidimensional narrative patterns that vary across different content sources and topics.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

teagrjohnson

Paper author Paper submitter about 5 hours ago

Where do narratives live in pretraining data? Check out this paper to find out!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 2

Datasets citing this paper 3

Spaces citing this paper 1

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Abstract

Community

Models citing this paper 2

Datasets citing this paper 3

Spaces citing this paper 1

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers