Live Human Detector on Outbound Phone Calls [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Goal
To save humans wasting time sitting in Call Centre queues waiting to be answered
To have tool listen in on the audio stream of a live call, post IVR Navigation - to determine whether the call has transitioned out of the queue and to a live person.
Requirements
The tool must be able to classify the audio within a sub 1-2 seconds contextual window with as high confidence level as possible.
This is not a typical AMD tool, we are not just detecting machine audio vs human speech
Assumed Challenges
- It may be difficult to determine between a pre-recorded RVA (Recorded Voice Announcement) and a human speaking. RVA typically are professionally recorded with distinct pitches and emotional queues, have clean audio with no background noise or silence before and after the message. This is not always the case, especially if announcements are recorded in house by the general staff.
- When a call is transitioning and 'Answered' there is usually a distinct soft click and or some background noise before the agent starts speaking. This silence period, whilst a good indication a call has been answered could be confused with quiet periods between music or RVA announcements in the queue.
- It may be difficult to determine if we have been answered by Voicemail - whilst there is usually a beep at the end, the message itself would also start with a silence period followed by audio sounding similar to an RVA.
- A single short beep tone could mean Voicemail, Answered or it could mean the call is being recorded
- Identifying we are in a queue based on TTS audio may be difficult to identify as TTS engines become more sophisticated
- Telephony or G711a is in the frequency band of 300–3400 Hz @ 8000hz - 64 kbit/s
Approach
To train via machine leaning using labelled data, an audio classification application that analyses the acoustics, wav form or spectrograph (via Fast Fourier Transform) of the audio stream
At this stage I do not want to use STT to determine the phase or label - Although this will likely be added at a later stage as an additional layer in the pipline to increase confidence in some of these labels such as RVA/TTS/Voicemail/Call Screening
Phase
Queuing
Labels
Music, TTS, RVA (Recorded Voice Announcement)
Transitioning
Labels
Ringback, Answered, Machine Beep
Connected
Labels
Human, Fax, Voicemail, Call Screening
Disconnected
Labels
Engaged Tone
References
https://www.mdpi.com/2076-3417/12/7/3293 - YOHO You only here once
https://www.vicidial.org/VICIDIALforum/viewtopic.php?t=42330
https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline
https://www.youtube.com/watch?v=m3XbqfIij_Y&t=32s
https://google-ai-edge.github.io/mediapipe-samples-web/#/audio/audio_classifier
https://scikit-learn.org/stable/machine_learning_map.html
https://arxiv.org/pdf/2410.08235
Question
Seeking assisance on where to actually start. Yes I be relying heavily on claude code to build this so apologies in advance
What is the best framework / algo rhythm / approach to start solving this problem. I have seen existing frameworks like YamNet work well and fast on classifying audio - however other suggest Whisper and ASR
What is the best way of tagging or labelling data. Do I label existing full length recordings with stop/start timestamps or each label or do I need to split each label into its own file - resulting in a loss of context.
Are there obvious existing data sets I should be using for some of my labels
[link] [comments]
More from r/MachineLearning
-
NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]
May 22
-
One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]
May 22
-
Novel Problems in VLA [R]
May 22
-
Can liveness detection models generalise to synthetic media generation techniques they were never trained on? [D]
May 21
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.