I fine-tuned Cohere Transcribe to support diarization and timestamps
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hi
I'll keep it short:
Cohere-transcribe is currently the best open source speech to text model (and possibly even better than other proprietary models).
BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the tokenizer.
SO I trained the model to support it. It follows the standard timestamp standard.
The output now looks like this:
<|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|> Which is an easily parsable format.
The timestamps are accurate within 0.097 seconds on average, and 90% are within 0.006 seconds.
The model supports up to 4 speakers per 30 seconds, and using the diarize_long.py script, it could accurately identify up to 32 people.
It's available for free on huggingface.
Enjoy!
[link] [comments]
More from r/LocalLLaMA
-
Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)
May 22
-
Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT
May 22
-
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
May 22
-
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.