noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
If you've ever tried to pick an STT vendor for a phone-based voice agent or call center product, you've probably hit this wall: you have plenty of real production audio, but it's unlabeled, so you can't compute WER on it. And the annotated public datasets (FLEURS, CommonVoice, LibriSpeech) are clean studio recordings that have nothing to do with how STT models actually handle your G.711 encoded noisy phone calls.
Annotating production audio is slow, expensive, and usually a privacy headache. So most teams end up benchmarking on clean data, picking a vendor, then discovering in prod which one actually survives noise.
noisekit fills that gap. Take a clean annotated dataset, apply degradations that approximate your production conditions, end up with a noisy annotated corpus you can run WER on across every STT candidate.
uvx noisekit generate \ --dataset google/fleurs --config en_us --split test \ --samples 100 \ --output ./noisy-fleurs Feed ./noisy-fleurs through each STT candidate, normalize, and compute WER with the existing transcripts. The output is HuggingFace AudioFolder-compatible, so load_dataset("audiofolder", data_dir="./noisy-fleurs") works.
Presets cover the conditions that actually matter for voice products:
- telecom: G.711 narrowband bandpass + 8-bit BitCrush + 16-32 kbps MP3 (sounds like a real phone call, not a synthetic low-pass filter)
- noise: real ambient mixed at 5-15 dB SNR (auto-downloads a MUSAN noise-only subset, or bring your own --noise-dir matching your domain: call center, cafe, car, street)
- reverb: pyroomacoustics far-field at 1-3 m mic distance
- low_bitrate: wideband MP3 at 16-32 kbps
- clipping: ADC / mic saturation
- clean_reference: control / WER floor
- compound chains stack realistically. noise_telecom = noisy room then phone codec, which is what an actual support call sounds like.
Each output gets PESQ, SNR and NISQA scores in metadata.jsonl alongside the original transcript, so you can correlate WER with measured signal quality after the fact.
Repo: https://github.com/karamouche/noisekit (MIT, uvx-runnable so zero install)
Genuinely curious to hear from people who've benchmarked STT in production: what degradation conditions am I missing?
[link] [comments]
More from r/MachineLearning
-
EMA-Gated Temporal Sequence Compression in Vision Transformers [P]
May 27
-
Cross-species RSA: same learning rules (BP, PC, STDP, FA) tested against both human fMRI and macaque electrophysiology [P]
May 27
-
Profiling PyTorch training without accidentally stalling the GPU [D]
May 27
-
A Tiny Open-Source Self-Driving AI That Runs on a Phone [P]
May 27
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.