r/LocalLLaMA · May 22, 2026 · 1 min read

I fine-tuned Cohere Transcribe to support diarization and timestamps

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'll keep it short:
Cohere-transcribe is currently the best open source speech to text model (and possibly even better than other proprietary models).

BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the tokenizer.

SO I trained the model to support it. It follows the standard timestamp standard.

The output now looks like this:

<|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>

Which is an easily parsable format.

The timestamps are accurate within 0.097 seconds on average, and 90% are within 0.006 seconds.

The model supports up to 4 speakers per 30 seconds, and using the diarize_long.py script, it could accurately identify up to 32 people.

I fine-tuned Cohere Transcribe to support diarization and timestamps

Discussion (0)

More from r/LocalLLaMA