r/LocalLLaMA · · 1 min read

I fine-tuned Cohere Transcribe to support diarization and timestamps

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi

I'll keep it short:
Cohere-transcribe is currently the best open source speech to text model (and possibly even better than other proprietary models).

BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the tokenizer.

SO I trained the model to support it. It follows the standard timestamp standard.

The output now looks like this:

<|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|> 

Which is an easily parsable format.

The timestamps are accurate within 0.097 seconds on average, and 90% are within 0.006 seconds.

The model supports up to 4 speakers per 30 seconds, and using the diarize_long.py script, it could accurately identify up to 32 people.

It's available for free on huggingface.

Enjoy!

submitted by /u/iamMess
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA