Hi <span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"Maurice"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/Maurice\">@<span class=\"underline\">Maurice</span></a></span> </span></span> Kraus and team,</p>\n<p>the paper looks really interesting, I have to check the translated dataset. </p>\n<p>After reading the paper once, I was wondering if the used translation prompt is not too short and lacks of potential filtering and instructions - compared to the used translation prompt of the FineTranslations project, which can be found <a href=\"https://github.com/huggingface/finetranslations/blob/main/2_run_gemma/run_pipeline.py\" rel=\"nofollow\">here</a>. In general I am missing a reference to the FineTranslations project (see <a href=\"https://huggingface.co/datasets/HuggingFaceFW/finetranslations\">here</a> which should be definitely added :)</p>\n","updatedAt":"2026-06-03T14:18:38.418Z","author":{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","fullname":"Stefan Schweter","name":"stefan-it","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3945,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8646417260169983},"editors":["stefan-it"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"6a203d8826fac447c5450292","author":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","fullname":"Maurice Kraus","name":"mkrausio","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false},"createdAt":"2026-06-03T14:43:20.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hey @stefan-it, thanks for the interest and the heads-up.\n\nWe were not aware of this dataset, as there is no accompanying paper. Yes, our prompt is indeed short because, during our initial probing and manual inspection, a longer prompt did not seem necessary. Also, compared to FineTranslations, our dataset is based on ClimbMix, a high-quality dataset that has already been filtered (using Nemotron CC and other measures).\n\nFurthermore, we provide proxy-score-based measures to further improve data quality.\n\nThat said, we will take a closer look at this dataset and will definitely include it in a future revised version of the paper.\n\nBests,\nMaurice + Authors","html":"<p>Hey <span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"stefan-it"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/stefan-it\">@<span class=\"underline\">stefan-it</span></a></span> </span></span>, thanks for the interest and the heads-up.</p>\n<p>We were not aware of this dataset, as there is no accompanying paper. Yes, our prompt is indeed short because, during our initial probing and manual inspection, a longer prompt did not seem necessary. Also, compared to FineTranslations, our dataset is based on ClimbMix, a high-quality dataset that has already been filtered (using Nemotron CC and other measures).</p>\n<p>Furthermore, we provide proxy-score-based measures to further improve data quality.</p>\n<p>That said, we will take a closer look at this dataset and will definitely include it in a future revised version of the paper.</p>\n<p>Bests,<br>Maurice + Authors</p>\n","updatedAt":"2026-06-03T14:43:50.482Z","author":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","fullname":"Maurice Kraus","name":"mkrausio","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9687474966049194},"editors":["mkrausio"],"editorAvatarUrls":["/avatars/83e89dda95e2139f95492eee0da2e471.svg"],"reactions":[],"isReport":false},"replies":[{"id":"6a2043b326fac447c545bfdd","author":{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","fullname":"Stefan Schweter","name":"stefan-it","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3945,"isUserFollowing":false},"createdAt":"2026-06-03T15:09:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Many thanks Maurice!\n\nI also really like these cluster labels in the dataset. Maybe these cluster labels could be used as a kind of subset if users are only interested in specific clusters. We have done a similar approach with our [German Commons](https://huggingface.co/datasets/coral-nlp/german-commons) dataset. \n\nAnd I would highly like to see a kind dataset subsets/samples based on tokens, e.g. FineWeb offers these `sample-350BT` or `sample-10BT` splits. But enough feature requests for now 😅","html":"<p>Many thanks Maurice!</p>\n<p>I also really like these cluster labels in the dataset. Maybe these cluster labels could be used as a kind of subset if users are only interested in specific clusters. We have done a similar approach with our <a href=\"https://huggingface.co/datasets/coral-nlp/german-commons\">German Commons</a> dataset. </p>\n<p>And I would highly like to see a kind dataset subsets/samples based on tokens, e.g. FineWeb offers these <code>sample-350BT</code> or <code>sample-10BT</code> splits. But enough feature requests for now 😅</p>\n","updatedAt":"2026-06-03T15:09:39.219Z","author":{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","fullname":"Stefan Schweter","name":"stefan-it","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3945,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9089218974113464},"editors":["stefan-it"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6a203d8826fac447c5450292"}},{"id":"6a215edf2185a047be48333e","author":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","fullname":"Maurice Kraus","name":"mkrausio","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false},"createdAt":"2026-06-04T11:17:51.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"We will evaluate whether something like this is feasible. In the meantime, you can simply filter based on the cluster labels :)\n\nWe plan to upload at least one additional version based on our 60-proxy filtering. We'll also provide the 12B variant used in our ablation studies.\n","html":"<p>We will evaluate whether something like this is feasible. In the meantime, you can simply filter based on the cluster labels :)</p>\n<p>We plan to upload at least one additional version based on our 60-proxy filtering. We'll also provide the 12B variant used in our ablation studies.</p>\n","updatedAt":"2026-06-04T11:21:31.102Z","author":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","fullname":"Maurice Kraus","name":"mkrausio","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.962401270866394},"editors":["mkrausio"],"editorAvatarUrls":["/avatars/83e89dda95e2139f95492eee0da2e471.svg"],"reactions":[],"isReport":false,"parentCommentId":"6a203d8826fac447c5450292"}}]},{"id":"6a218d59f478b2e6cad35a67","author":{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","fullname":"Ruben Härle","name":"RuHae","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2026-06-04T14:36:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"","html":"<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/65b36f38638328850ebda93d/bltCNBpZRmwI_V-EczY6L.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65b36f38638328850ebda93d/bltCNBpZRmwI_V-EczY6L.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-04T14:36:09.735Z","author":{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","fullname":"Ruben Härle","name":"RuHae","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5103356838226318},"editors":["RuHae"],"editorAvatarUrls":["/avatars/965974657b11ee1031576258459ce3e1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03773","authors":[{"_id":"6a1fb576e292c1c78ecb14f4","user":{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","isPro":false,"fullname":"Maurice Kraus","user":"mkrausio","type":"user","name":"mkrausio"},"name":"Maurice Kraus","status":"claimed_verified","statusLastChangedAt":"2026-06-03T14:17:09.275Z","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f5","user":{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","isPro":false,"fullname":"Ruben Härle","user":"RuHae","type":"user","name":"RuHae"},"name":"Ruben Härle","status":"claimed_verified","statusLastChangedAt":"2026-06-03T14:17:11.400Z","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f6","user":{"_id":"67e5721b169edeab9a5cd781","avatarUrl":"/avatars/521cbfdd3691f7f02132339aaf1d32e9.svg","isPro":false,"fullname":"Sebastian Sztwiertnia","user":"sebawastaken","type":"user","name":"sebawastaken"},"name":"Sebastian Sztwiertnia","status":"claimed_verified","statusLastChangedAt":"2026-06-04T12:42:33.473Z","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f7","name":"Abbas Goher Khan","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f8","name":"Mehdi Ali","hidden":false},{"_id":"6a1fb576e292c1c78ecb14f9","name":"Michael Fromm","hidden":false},{"_id":"6a1fb576e292c1c78ecb14fa","name":"Kristian Kersting","hidden":false}],"publishedAt":"2026-06-02T15:28:15.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"KletterMix: Climbing Toward High-Quality German Pretraining Data","submittedOnDailyBy":{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","isPro":false,"fullname":"Ruben Härle","user":"RuHae","type":"user","name":"RuHae"},"summary":"High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.","upvotes":10,"discussionId":"6a1fb576e292c1c78ecb14fb","projectPage":"https://huggingface.co/collections/AIML-TUDA/klettermix","ai_summary":"A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks.","ai_keywords":["language model pretraining","German-language resources","English pretraining corpus","document boundaries","metadata","topical diversity","translation quality","corpus-level analyses","COMETKiwi","controlled pretraining","annealing ablations","downstream evaluations"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"634fe87117a6475e8bfd178a","name":"AIML-TUDA","fullname":"Artificial Intelligence & Machine Learning Lab at TU Darmstadt","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666181274838-62fa1d95e8c9c532aa75331c.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6399acd4074f7c531d57cdc1","avatarUrl":"/avatars/83e89dda95e2139f95492eee0da2e471.svg","isPro":false,"fullname":"Maurice Kraus","user":"mkrausio","type":"user"},{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","isPro":true,"fullname":"Stefan Schweter","user":"stefan-it","type":"user"},{"_id":"67e5721b169edeab9a5cd781","avatarUrl":"/avatars/521cbfdd3691f7f02132339aaf1d32e9.svg","isPro":false,"fullname":"Sebastian Sztwiertnia","user":"sebawastaken","type":"user"},{"_id":"689c54b3d064033db6b5c77b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/VZQGt73F5ORH1_DVUBn5P.png","isPro":false,"fullname":"Simon Kohaut","user":"skohaut","type":"user"},{"_id":"62d9b2e5cfed764363b3145f","avatarUrl":"/avatars/b9f44d3fee8caa8888ca40280dbe8828.svg","isPro":false,"fullname":"Antonia Wüst","user":"toniwuest","type":"user"},{"_id":"65a2e3e06e52f8334066c9a6","avatarUrl":"/avatars/7227d58c3f536e7d6a298a5c4c3dbff7.svg","isPro":false,"fullname":"Teng Cao","user":"Tenggggg","type":"user"},{"_id":"6a031b62ff8c9f9521e4cbe2","avatarUrl":"/avatars/fa3f6acb35cbb6d904e84a5a8419defa.svg","isPro":false,"fullname":"Maurice Kraus","user":"mkraus-random","type":"user"},{"_id":"62e7dd4036a8e8a82700041c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e7dd4036a8e8a82700041c/Dgk9mXYLVd4LpiNLWjn-q.jpeg","isPro":false,"fullname":"Felix Friedrich","user":"felfri","type":"user"},{"_id":"65b36f38638328850ebda93d","avatarUrl":"/avatars/965974657b11ee1031576258459ce3e1.svg","isPro":false,"fullname":"Ruben Härle","user":"RuHae","type":"user"},{"_id":"63d7fc0b07cd1aa3c49de905","avatarUrl":"/avatars/9212b6d0ed781dca0bfcf58377356bc1.svg","isPro":false,"fullname":"remunds","user":"remunds","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"634fe87117a6475e8bfd178a","name":"AIML-TUDA","fullname":"Artificial Intelligence & Machine Learning Lab at TU Darmstadt","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666181274838-62fa1d95e8c9c532aa75331c.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03773.md"}">
KletterMix: Climbing Toward High-Quality German Pretraining Data
Abstract
A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks.
High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.
Community
Hi @Maurice Kraus and team,
the paper looks really interesting, I have to check the translated dataset.
After reading the paper once, I was wondering if the used translation prompt is not too short and lacks of potential filtering and instructions - compared to the used translation prompt of the FineTranslations project, which can be found here. In general I am missing a reference to the FineTranslations project (see here which should be definitely added :)
Hey @stefan-it , thanks for the interest and the heads-up.
We were not aware of this dataset, as there is no accompanying paper. Yes, our prompt is indeed short because, during our initial probing and manual inspection, a longer prompt did not seem necessary. Also, compared to FineTranslations, our dataset is based on ClimbMix, a high-quality dataset that has already been filtered (using Nemotron CC and other measures).
Furthermore, we provide proxy-score-based measures to further improve data quality.
That said, we will take a closer look at this dataset and will definitely include it in a future revised version of the paper.
Bests,
Maurice + Authors
Many thanks Maurice!
I also really like these cluster labels in the dataset. Maybe these cluster labels could be used as a kind of subset if users are only interested in specific clusters. We have done a similar approach with our German Commons dataset.
And I would highly like to see a kind dataset subsets/samples based on tokens, e.g. FineWeb offers these sample-350BT or sample-10BT splits. But enough feature requests for now 😅
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.03773 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.03773 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.