r/LocalLLaMA · May 12, 2026 · 2 min read

I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC

I’ve been working on a tool called Derpy Turtle: The Kokoro Trainer. It started as a random-walk experiment for Kokoro voices, but it has grown into its own thing: a Windows GUI for creating better local voice outputs by combining Kokoro voice search with RVC voice conversion.

The short version:

Kokoro is good at generating speech. RVC is good at matching a target voice. Derpy Turtle connects the two.

The app lets you:

- Load a target voice clip.

- Search/refine Kokoro `.pt` voices against that target.

- Train an RVC model from your target audio.

- Generate Kokoro speech.

- Automatically pass the output through your trained RVC model.

- Save the final converted `_rvc.wav`.

The important lesson I learned is that chasing a super high Kokoro similarity score alone is not enough. I was stuck around the low/mid 80% range even after very long runs. The output improved, but it still did not sound close enough. The better approach was to use Kokoro as the clean speech source, then let RVC handle the final voice identity.

So the current workflow is:

Train an RVC model from clean target audio.
Run a short Kokoro search/refinement to get stable speech.
Enable “Use Latest RVC”.
Generate the line.
Listen to the `_rvc.wav`, not just the optimizer score.

The GUI has presets, queue management, ETA logging, extra target audio support, per-audio transcript mapping, CUDA support, and a launcher `.exe` that handles first-time setup.

A few practical notes:

- You need clean training audio. A smaller clean dataset beats a larger noisy one.

- RVC helps with timbre/identity, but it does not magically fix bad pacing or pronunciation.

- The Kokoro similarity score is pre-RVC, so the final converted audio can sound much better even if the score does not change.

- CUDA makes a huge difference. On my RTX 3060, GPU mode cut one run from roughly 26 hours on CPU to about 4 hours.

It's 100% free for non-commercial use. Personal/research use is allowed, but anyone wanting commercial use would need to contact me.

The goal is to make local voice experimentation more accessible. I made everything as user-friendly as possible. I wanted something where a non-technical user could run an .exe, load target audio, train/refine, and actually get usable output without manually wiring together a bunch of tools.

I've added this process to my game here, if anyone wants to experience it in practice. All the voices are trained using this trainer.

Enjoy!

submitted by /u/Great-Investigator30
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA