r/MachineLearning · May 22, 2026 · 3 min read

Live Human Detector on Outbound Phone Calls [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Goal
To save humans wasting time sitting in Call Centre queues waiting to be answered

To have tool listen in on the audio stream of a live call, post IVR Navigation - to determine whether the call has transitioned out of the queue and to a live person.

Requirements

The tool must be able to classify the audio within a sub 1-2 seconds contextual window with as high confidence level as possible.

This is not a typical AMD tool, we are not just detecting machine audio vs human speech

Assumed Challenges

It may be difficult to determine between a pre-recorded RVA (Recorded Voice Announcement) and a human speaking. RVA typically are professionally recorded with distinct pitches and emotional queues, have clean audio with no background noise or silence before and after the message. This is not always the case, especially if announcements are recorded in house by the general staff.
When a call is transitioning and 'Answered' there is usually a distinct soft click and or some background noise before the agent starts speaking. This silence period, whilst a good indication a call has been answered could be confused with quiet periods between music or RVA announcements in the queue.
It may be difficult to determine if we have been answered by Voicemail - whilst there is usually a beep at the end, the message itself would also start with a silence period followed by audio sounding similar to an RVA.
A single short beep tone could mean Voicemail, Answered or it could mean the call is being recorded
Identifying we are in a queue based on TTS audio may be difficult to identify as TTS engines become more sophisticated
Telephony or G711a is in the frequency band of 300–3400 Hz @ 8000hz - 64 kbit/s

Approach

To train via machine leaning using labelled data, an audio classification application that analyses the acoustics, wav form or spectrograph (via Fast Fourier Transform) of the audio stream

At this stage I do not want to use STT to determine the phase or label - Although this will likely be added at a later stage as an additional layer in the pipline to increase confidence in some of these labels such as RVA/TTS/Voicemail/Call Screening

Phase

Queuing

Labels

Music, TTS, RVA (Recorded Voice Announcement)

Transitioning

Labels

Ringback, Answered, Machine Beep

Connected

Labels

Human, Fax, Voicemail, Call Screening

Disconnected

Labels

Engaged Tone

References

https://www.mdpi.com/2076-3417/12/7/3293 - YOHO You only here once
https://www.vicidial.org/VICIDIALforum/viewtopic.php?t=42330

https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline

https://www.youtube.com/watch?v=m3XbqfIij_Y&t=32s

https://google-ai-edge.github.io/mediapipe-samples-web/#/audio/audio_classifier

https://scikit-learn.org/stable/machine_learning_map.html

https://arxiv.org/pdf/2410.08235

Question

Seeking assisance on where to actually start. Yes I be relying heavily on claude code to build this so apologies in advance

What is the best framework / algo rhythm / approach to start solving this problem. I have seen existing frameworks like YamNet work well and fast on classifying audio - however other suggest Whisper and ASR

What is the best way of tagging or labelling data. Do I label existing full length recordings with stop/start timestamps or each label or do I need to split each label into its own file - resulting in a loss of context.

Are there obvious existing data sets I should be using for some of my labels

submitted by /u/Bucky102
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning