As the name suggests, we will be using a Deep Bidirectional Recurrent Neural Network with LSTMs (DBRNN) to achieve the state-of-the-art performance described by Graves et al. using a normal speech dataset (no speech impediment). This model will include Mel Frequency Cepstral Coefficients (MFCC) for filtering and feature extraction. We will also use Connectionist Temporal Classification (CTC) for data aligning and labeling unsegmented sequences; CTC is used as the cost function.A Word to ARPAbet phonemes dictionary from CMU is used here as well.
Output phonemes are then post processed by altering the phonemes sequence to generate potential words. Those word are then fed to another Recurrent neural network that acts as a language model assigning probabilities for words to occur given previous word(s). Beam search is to be used for efficient scanning of possible sentences.
This project is likely to have a speaker dependent version to increase the accuracy of Automatic Speech Recognition. TensorFlow is the framework to be used in this project. A GUI will be designed using QT designer for easy use and demonstrations.