dots.tts is a 2B-parameter continuous autoregressive text-to-speech foundation model utilizing Audio-VAE, flow-matching with full-history conditioning, and reward-free self-corrective post-training to achieve state-of-the-art speech generation performance.</p>\n","updatedAt":"2026-06-08T01:49:09.336Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":312,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.876462996006012},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.07080","authors":[{"_id":"6a261f87e4c258a029491f87","name":"Shi Lian","hidden":false},{"_id":"6a261f87e4c258a029491f88","name":"Changtao Li","hidden":false},{"_id":"6a261f87e4c258a029491f89","name":"Bohan Li","hidden":false},{"_id":"6a261f87e4c258a029491f8a","name":"Hankun Wang","hidden":false},{"_id":"6a261f87e4c258a029491f8b","name":"Da Zheng","hidden":false},{"_id":"6a261f87e4c258a029491f8c","name":"Junfeng Tian","hidden":false},{"_id":"6a261f87e4c258a029491f8d","name":"Yufeng Ma","hidden":false},{"_id":"6a261f87e4c258a029491f8e","name":"Colin Zhang","hidden":false},{"_id":"6a261f87e4c258a029491f8f","name":"Kai Yu","hidden":false}],"publishedAt":"2026-06-05T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"dots.tts Technical Report","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.","upvotes":10,"discussionId":"6a261f87e4c258a029491f90","githubRepo":"https://github.com/rednote-hilab/dots.tts","githubRepoAddedBy":"user","ai_summary":"A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques.","ai_keywords":["continuous autoregressive text-to-speech","AudioVAE","flow-matching head","reward-free self-corrective post-training","Seed-TTS-Eval","WER","SIM scores","CFG-aware MeanFlow distillation","output streaming","dual-streaming"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":210},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"66893da3a9ee6373c7b7abc1","avatarUrl":"/avatars/2bd63b532672d7c7b7c8679d805bd0f0.svg","isPro":false,"fullname":"colin zhang","user":"ColinZhangKX","type":"user"},{"_id":"6a013b3e5c1f78636cc10e31","avatarUrl":"/avatars/344798703df58c264a620c87d657f112.svg","isPro":false,"fullname":"ma","user":"yufengxhs","type":"user"},{"_id":"66f4ca1de0a132d3f4ca710d","avatarUrl":"/avatars/3ff5a5709a6a7e144cad3398bff7d862.svg","isPro":false,"fullname":"Basil Li","user":"Basilli","type":"user"},{"_id":"63536a80ea02c729057fca32","avatarUrl":"/avatars/9405a6db67f849d6c9569751f9007fc1.svg","isPro":false,"fullname":"Zader","user":"YuMS","type":"user"},{"_id":"6807a1d6504547b3554b9c73","avatarUrl":"/avatars/57732124b8edcd4ef7258ee66412357a.svg","isPro":false,"fullname":"redmoe-ai-v1","user":"redmoe-ai-v1","type":"user"},{"_id":"6a1fcc6a59a9fcfb676d1abb","avatarUrl":"/avatars/db387787ff147b7bd81110a2f6793292.svg","isPro":false,"fullname":"lians","user":"xlians555","type":"user"},{"_id":"643b62ac065961b2252abb7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643b62ac065961b2252abb7a/6F140r2GfdpItuvSk8GNc.jpeg","isPro":false,"fullname":"zuijiang","user":"zuijiang","type":"user"},{"_id":"656d8d4b1f8d9b618de91369","avatarUrl":"/avatars/884dba9e56936241034b179d11a513b9.svg","isPro":false,"fullname":"Xiangdong Zhang","user":"aHapBean","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.07080.md"}">
dots.tts Technical Report
Abstract
A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques.
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.
Community
dots.tts is a 2B-parameter continuous autoregressive text-to-speech foundation model utilizing Audio-VAE, flow-matching with full-history conditioning, and reward-free self-corrective post-training to achieve state-of-the-art speech generation performance.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.07080 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.07080 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.07080 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.