Improving Speech Transcription Accuracy Using Computer Vision & Lip Reading with Intel RealSense

Existing machine learning speech transcription approaches mainly work in the audio domain, where the state of the art approaches the accuracy of human transcriptions in ideal conditions. However, these approaches yield much poorer or even useless results in noisy environments, especially in any scenario when many simultaneous conversations are in process. (This is known as 'the cocktail party problem') We will explore an approach to augment and improve on the audio transcription results, especially in a noisy environment, by lip reading from a live video feed from an Intel RealSense camera using a machine vision model in OpenCV and the Intel OpenVINO toolkit. This project supports our entry into the IBM Watson AI XPRIZE competition. ...learn more

Project status: Under Development

Robotics, RealSense™, Internet of Things, Game Development, Artificial Intelligence

Intel Technologies
AI DevCloud / Xeon, Movidius NCS, OpenVINO

Links [3]

Overview / Usage

Existing machine learning speech transcription approaches mainly work in the audio domain, where the state of the art approaches the accuracy of human transcriptions in ideal conditions. However, these approaches yield much poorer or even useless results in noisy environments, especially in any scenario when many simultaneous conversations are in process. (This is known as 'the cocktail party problem')

We will explore an approach to augment and improve on the audio transcription results, especially in a noisy environment, by lip reading from a live video feed from an Intel RealSense camera using a machine vision model in OpenCV.

As far as we know, conventional lip reading computer vision algorithms use only conventional RGB video, but in this project we also plan to explore whether augmenting the video with the 3D data from the Intel RealSense sensor could further improve lip reading results. Also, our particular application may benefit from accurately detecting only a very small vocabulary of command words using 3D vision. We will investigate the potential for detecting just these words in real time can yield high accuracy with a minimum of processing, as compared to general speech transcription.

Methodology / Approach

Using state of the art machine learning frameworks and leveraging OpenVINO for hardware optimizations, we will train, test, and deploy for inferences a deep learning model for AVSR, incorporating audio and video as well as depth data.

Technologies Used

Intel OpenVINO

Intel RealSense D415

Intel NUC

Intel NCS2

Collaborators

3 Results

3 Results

Comments (0)