I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I’ve been working on a tool called Derpy Turtle: The Kokoro Trainer. It started as a random-walk experiment for Kokoro voices, but it has grown into its own thing: a Windows GUI for creating better local voice outputs by combining Kokoro voice search with RVC voice conversion. The short version: Kokoro is good at generating speech. RVC is good at matching a target voice. Derpy Turtle connects the two. The app lets you: - Load a target voice clip. - Search/refine Kokoro `.pt` voices against that target. - Train an RVC model from your target audio. - Generate Kokoro speech. - Automatically pass the output through your trained RVC model. - Save the final converted `_rvc.wav`. The important lesson I learned is that chasing a super high Kokoro similarity score alone is not enough. I was stuck around the low/mid 80% range even after very long runs. The output improved, but it still did not sound close enough. The better approach was to use Kokoro as the clean speech source, then let RVC handle the final voice identity. So the current workflow is:
The GUI has presets, queue management, ETA logging, extra target audio support, per-audio transcript mapping, CUDA support, and a launcher `.exe` that handles first-time setup. A few practical notes: - You need clean training audio. A smaller clean dataset beats a larger noisy one. - RVC helps with timbre/identity, but it does not magically fix bad pacing or pronunciation. - The Kokoro similarity score is pre-RVC, so the final converted audio can sound much better even if the score does not change. - CUDA makes a huge difference. On my RTX 3060, GPU mode cut one run from roughly 26 hours on CPU to about 4 hours. It's 100% free for non-commercial use. Personal/research use is allowed, but anyone wanting commercial use would need to contact me. The goal is to make local voice experimentation more accessible. I made everything as user-friendly as possible. I wanted something where a non-technical user could run an .exe, load target audio, train/refine, and actually get usable output without manually wiring together a bunch of tools. I've added this process to my game here, if anyone wants to experience it in practice. All the voices are trained using this trainer. Enjoy! [link] [comments] |
More from r/LocalLLaMA
-
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)
May 13
-
Web-Search is coming to a screeching performance halt as Google shuts down their free search index, and traffic defenders like Cloudflare challenge AI at every gateway. What are our options?
May 13
-
Side Projects.
May 13
-
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)
May 13
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.