Hugging Face Daily Papers · June 3, 2026 · 8 min read

KletterMix: Climbing Toward High-Quality German Pretraining Data

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Hi <a href=\"/Maurice\">@Maurice</a> Kraus and team,\nthe paper looks really interesting, I have to check the translated dataset. \nAfter reading the paper once, I was wondering if the used translation prompt is not too short and lacks of potential filtering and instructions - compared to the used translation prompt of the FineTranslations project, which can be found <a href=\"https://github.com/huggingface/finetranslations/blob/main/2_run_gemma/run_pipeline.py\" rel=\"nofollow\">here</a>. In general I am missing a reference to the FineTranslations project (see <a href=\"https://huggingface.co/datasets/HuggingFaceFW/finetranslations\">here</a> which should be definitely added :)\n","updatedAt":"2026-06-03T14:18:38.418Z","author":{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","fullname":"Stefan Schweter","name":"stefan-it","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3945,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8646417260169983},"editors":["stefan-it"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"6a203d8826fac447c5450292","author":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","fullname":"Maurice Kraus","name":"mkrausio","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false},"createdAt":"2026-06-03T14:43:20.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hey @stefan-it, thanks for the interest and the heads-up.\n\nWe were not aware of this dataset, as there is no accompanying paper. Yes, our prompt is indeed short because, during our initial probing and manual inspection, a longer prompt did not seem necessary. Also, compared to FineTranslations, our dataset is based on ClimbMix, a high-quality dataset that has already been filtered (using Nemotron CC and other measures).\n\nFurthermore, we provide proxy-score-based measures to further improve data quality.\n\nThat said, we will take a closer look at this dataset and will definitely include it in a future revised version of the paper.\n\nBests,\nMaurice + Authors","html":"Hey <a href=\"/stefan-it\">@stefan-it</a> , thanks for the interest and the heads-up.\nWe were not aware of this dataset, as there is no accompanying paper. Yes, our prompt is indeed short because, during our initial probing and manual inspection, a longer prompt did not seem necessary. Also, compared to FineTranslations, our dataset is based on ClimbMix, a high-quality dataset that has already been filtered (using Nemotron CC and other measures).\nFurthermore, we provide proxy-score-based measures to further improve data quality.\nThat said, we will take a closer look at this dataset and will definitely include it in a future revised version of the paper.\nBests, Maurice + Authors\n","updatedAt":"2026-06-03T14:43:50.482Z","author":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","fullname":"Maurice Kraus","name":"mkrausio","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9687474966049194},"editors":["mkrausio"],"editorAvatarUrls":["/avatars/83e89dda95e2139f95492eee0da2e471.svg"],"reactions":[],"isReport":false},"replies":[{"id":"6a2043b326fac447c545bfdd","author":{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","fullname":"Stefan Schweter","name":"stefan-it","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3945,"isUserFollowing":false},"createdAt":"2026-06-03T15:09:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Many thanks Maurice!\n\nI also really like these cluster labels in the dataset. Maybe these cluster labels could be used as a kind of subset if users are only interested in specific clusters. We have done a similar approach with our [German Commons](https://huggingface.co/datasets/coral-nlp/german-commons) dataset. \n\nAnd I would highly like to see a kind dataset subsets/samples based on tokens, e.g. FineWeb offers these `sample-350BT` or `sample-10BT` splits. But enough feature requests for now 😅","html":"Many thanks Maurice!\nI also really like these cluster labels in the dataset. Maybe these cluster labels could be used as a kind of subset if users are only interested in specific clusters. We have done a similar approach with our <a href=\"https://huggingface.co/datasets/coral-nlp/german-commons\">German Commons</a> dataset. \nAnd I would highly like to see a kind dataset subsets/samples based on tokens, e.g. FineWeb offers these <code>sample-350BT</code> or <code>sample-10BT</code> splits. But enough feature requests for now 😅\n","updatedAt":"2026-06-03T15:09:39.219Z","author":{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","fullname":"Stefan Schweter","name":"stefan-it","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3945,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9089218974113464},"editors":["stefan-it"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6a203d8826fac447c5450292"}},{"id":"6a215edf2185a047be48333e","author":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","fullname":"Maurice Kraus","name":"mkrausio","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false},"createdAt":"2026-06-04T11:17:51.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"We will evaluate whether something like this is feasible. In the meantime, you can simply filter based on the cluster labels :)\n\nWe plan to upload at least one additional version based on our 60-proxy filtering. We'll also provide the 12B variant used in our ablation studies.\n","html":"We will evaluate whether something like this is feasible. In the meantime, you can simply filter based on the cluster labels :)\nWe plan to upload at least one additional version based on our 60-proxy filtering. We'll also provide the 12B variant used in our ablation studies.\n","updatedAt":"2026-06-04T11:21:31.102Z","author":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","fullname":"Maurice Kraus","name":"mkrausio","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.962401270866394},"editors":["mkrausio"],"editorAvatarUrls":["/avatars/83e89dda95e2139f95492eee0da2e471.svg"],"reactions":[],"isReport":false,"parentCommentId":"6a203d8826fac447c5450292"}}]},{"id":"6a218d59f478b2e6cad35a67","author":{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","fullname":"Ruben Härle","name":"RuHae","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2026-06-04T14:36:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"![image](https://cdn-uploads.huggingface.co/production/uploads/65b36f38638328850ebda93d/bltCNBpZRmwI_V-EczY6L.png)","html":"<a href=\"https://cdn-uploads.huggingface.co/production/uploads/65b36f38638328850ebda93d/bltCNBpZRmwI_V-EczY6L.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65b36f38638328850ebda93d/bltCNBpZRmwI_V-EczY6L.png\" alt=\"image\"></a>\n","updatedAt":"2026-06-04T14:36:09.735Z","author":{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","fullname":"Ruben Härle","name":"RuHae","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5103356838226318},"editors":["RuHae"],"editorAvatarUrls":["/avatars/965974657b11ee1031576258459ce3e1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03773","authors":[{"_id":"6a1fb576e292c1c78ecb14f4","user":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","isPro":false,"fullname":"Maurice Kraus","user":"mkrausio","type":"user","name":"mkrausio"},"name":"Maurice Kraus","status":"claimed_verified","statusLastChangedAt":"2026-06-03T14:17:09.275Z","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f5","user":{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","isPro":false,"fullname":"Ruben Härle","user":"RuHae","type":"user","name":"RuHae"},"name":"Ruben Härle","status":"claimed_verified","statusLastChangedAt":"2026-06-03T14:17:11.400Z","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f6","user":{"_id":"67e5721b169edeab9a5cd781","avatarUrl":"/avatars/521cbfdd3691f7f02132339aaf1d32e9.svg","isPro":false,"fullname":"Sebastian Sztwiertnia","user":"sebawastaken","type":"user","name":"sebawastaken"},"name":"Sebastian Sztwiertnia","status":"claimed_verified","statusLastChangedAt":"2026-06-04T12:42:33.473Z","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f7","name":"Abbas Goher Khan","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f8","name":"Mehdi Ali","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f9","name":"Michael Fromm","hidden":false},{"_id":"6a1fb576e292c1c78ecb14fa","name":"Kristian Kersting","hidden":false}],"publishedAt":"2026-06-02T15:28:15.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"KletterMix: Climbing Toward High-Quality German Pretraining Data","submittedOnDailyBy":{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","isPro":false,"fullname":"Ruben Härle","user":"RuHae","type":"user","name":"RuHae"},"summary":"High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.","upvotes":10,"discussionId":"6a1fb576e292c1c78ecb14fb","projectPage":"https://huggingface.co/collections/AIML-TUDA/klettermix","ai_summary":"A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks.","ai_keywords":["language model pretraining","German-language resources","English pretraining corpus","document boundaries","metadata","topical diversity","translation quality","corpus-level analyses","COMETKiwi","controlled pretraining","annealing ablations","downstream evaluations"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"634fe87117a6475e8bfd178a","name":"AIML-TUDA","fullname":"Artificial Intelligence & Machine Learning Lab at TU Darmstadt","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666181274838-62fa1d95e8c9c532aa75331c.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","isPro":false,"fullname":"Maurice Kraus","user":"mkrausio","type":"user"},{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","isPro":true,"fullname":"Stefan Schweter","user":"stefan-it","type":"user"},{"_id":"67e5721b169edeab9a5cd781","avatarUrl":"/avatars/521cbfdd3691f7f02132339aaf1d32e9.svg","isPro":false,"fullname":"Sebastian Sztwiertnia","user":"sebawastaken","type":"user"},{"_id":"689c54b3d064033db6b5c77b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/VZQGt73F5ORH1_DVUBn5P.png","isPro":false,"fullname":"Simon Kohaut","user":"skohaut","type":"user"},{"_id":"62d9b2e5cfed764363b3145f","avatarUrl":"/avatars/b9f44d3fee8caa8888ca40280dbe8828.svg","isPro":false,"fullname":"Antonia Wüst","user":"toniwuest","type":"user"},{"_id":"65a2e3e06e52f8334066c9a6","avatarUrl":"/avatars/7227d58c3f536e7d6a298a5c4c3dbff7.svg","isPro":false,"fullname":"Teng Cao","user":"Tenggggg","type":"user"},{"_id":"6a031b62ff8c9f9521e4cbe2","avatarUrl":"/avatars/fa3f6acb35cbb6d904e84a5a8419defa.svg","isPro":false,"fullname":"Maurice Kraus","user":"mkraus-random","type":"user"},{"_id":"62e7dd4036a8e8a82700041c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e7dd4036a8e8a82700041c/Dgk9mXYLVd4LpiNLWjn-q.jpeg","isPro":false,"fullname":"Felix Friedrich","user":"felfri","type":"user"},{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","isPro":false,"fullname":"Ruben Härle","user":"RuHae","type":"user"},{"_id":"63d7fc0b07cd1aa3c49de905","avatarUrl":"/avatars/9212b6d0ed781dca0bfcf58377356bc1.svg","isPro":false,"fullname":"remunds","user":"remunds","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"634fe87117a6475e8bfd178a","name":"AIML-TUDA","fullname":"Artificial Intelligence & Machine Learning Lab at TU Darmstadt","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666181274838-62fa1d95e8c9c532aa75331c.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03773.md"}">

Papers

arxiv:2606.03773

KletterMix: Climbing Toward High-Quality German Pretraining Data

Published on Jun 2

· Submitted by

Ruben Härle on Jun 4

Artificial Intelligence & Machine Learning Lab at TU Darmstadt

Upvote

Authors:

Maurice Kraus ,

Ruben Härle ,

Sebastian Sztwiertnia ,

Abstract

A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.

View arXiv page View PDF Project page Add to collection

Community

stefan-it

1 day ago

Hi @Maurice Kraus and team,

the paper looks really interesting, I have to check the translated dataset.

After reading the paper once, I was wondering if the used translation prompt is not too short and lacks of potential filtering and instructions - compared to the used translation prompt of the FineTranslations project, which can be found here. In general I am missing a reference to the FineTranslations project (see here which should be definitely added :)

mkrausio

Paper author 1 day ago

•

edited 1 day ago

Hey @stefan-it , thanks for the interest and the heads-up.

We were not aware of this dataset, as there is no accompanying paper. Yes, our prompt is indeed short because, during our initial probing and manual inspection, a longer prompt did not seem necessary. Also, compared to FineTranslations, our dataset is based on ClimbMix, a high-quality dataset that has already been filtered (using Nemotron CC and other measures).

Furthermore, we provide proxy-score-based measures to further improve data quality.

That said, we will take a closer look at this dataset and will definitely include it in a future revised version of the paper.

Bests,
Maurice + Authors

stefan-it

1 day ago

Many thanks Maurice!

I also really like these cluster labels in the dataset. Maybe these cluster labels could be used as a kind of subset if users are only interested in specific clusters. We have done a similar approach with our German Commons dataset.

And I would highly like to see a kind dataset subsets/samples based on tokens, e.g. FineWeb offers these sample-350BT or sample-10BT splits. But enough feature requests for now 😅

RuHae

Paper author Paper submitter about 11 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.03773

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.03773 in a model README.md to link it from this page.

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.03773 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

KletterMix: Climbing Toward High-Quality German Pretraining Data

Abstract

Community

Models citing this paper 0

Datasets citing this paper 3

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers