New Google AI system identifies individual voices in crowds

Avatar By Le Williams | 3 years ago

Google researchers have developed an AI deep learning system, having the ability to separate voices by looking at human faces during conversations and enhancing the voice quality.

The team of researchers facilitated training techniques through a neural network to first understand and recognize the individual voice of humans when conversing alone. Subsequent to the training, the system simulated virtual parties and combined the individual voices, teaching the AI to learn to isolate multiple voices into separate audio tracks.

A test clip featured on YouTube reveals Google’s ability to separate the voices of two comedians from Team Coco, identifying their faces and generating an audio track of the individual’s speech. Additionally, the video demonstrates step and step actions towards hearing one particular voice more distinctively by fading out the audience laughter.

According to Google, the method involves combining the auditory and visual signals of an input video to separate the speech. Google observes the movements of a human’s mouth and associates that with produced sounds as the human speaks. The combination of visual elements among the audio supports separating actions and creating clean speech tracks related to a specific visible speaker in a video.

Google is dedicated to exploring opportunities in testing features in its products such as Hangouts and Duo, improving the voice of the sender within a crowd.

“We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking,” said the Google Research Blog.

Google also considers the AI technology as a benefit towards automatic closed captioning systems where multiple speakers are overlapping each other, creating efficiency as a pre-process for speech recognition.