r/MachineLearning · May 27, 2026 · 2 min read

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

#voice #agents #benchmark #developer-tool #music

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

If you've ever tried to pick an STT vendor for a phone-based voice agent or call center product, you've probably hit this wall: you have plenty of real production audio, but it's unlabeled, so you can't compute WER on it. And the annotated public datasets (FLEURS, CommonVoice, LibriSpeech) are clean studio recordings that have nothing to do with how STT models actually handle your G.711 encoded noisy phone calls.

Annotating production audio is slow, expensive, and usually a privacy headache. So most teams end up benchmarking on clean data, picking a vendor, then discovering in prod which one actually survives noise.

noisekit fills that gap. Take a clean annotated dataset, apply degradations that approximate your production conditions, end up with a noisy annotated corpus you can run WER on across every STT candidate.

uvx noisekit generate \ --dataset google/fleurs --config en_us --split test \ --samples 100 \ --output ./noisy-fleurs

Feed ./noisy-fleurs through each STT candidate, normalize, and compute WER with the existing transcripts. The output is HuggingFace AudioFolder-compatible, so load_dataset("audiofolder", data_dir="./noisy-fleurs") works.

Presets cover the conditions that actually matter for voice products:

telecom: G.711 narrowband bandpass + 8-bit BitCrush + 16-32 kbps MP3 (sounds like a real phone call, not a synthetic low-pass filter)
noise: real ambient mixed at 5-15 dB SNR (auto-downloads a MUSAN noise-only subset, or bring your own --noise-dir matching your domain: call center, cafe, car, street)
reverb: pyroomacoustics far-field at 1-3 m mic distance
low_bitrate: wideband MP3 at 16-32 kbps
clipping: ADC / mic saturation
clean_reference: control / WER floor
compound chains stack realistically. noise_telecom = noisy room then phone codec, which is what an actual support call sounds like.

Each output gets PESQ, SNR and NISQA scores in metadata.jsonl alongside the original transcript, so you can correlate WER with measured signal quality after the fact.

Repo: https://github.com/karamouche/noisekit (MIT, uvx-runnable so zero install)

Genuinely curious to hear from people who've benchmarked STT in production: what degradation conditions am I missing?

submitted by /u/Karamouche
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning