Specs2Text

0 0
  • 0 Collaborators

This is a reimplementation of the Jasper speech-to-text model from a group using the functional Keras API within TensorFlow. The initial goal of this project is to reimplement the best performing version of the Jasper network and get it functioning on an Intel NUC. ...learn more

Project status: Under Development

Artificial Intelligence

Groups
Student Developers for AI

Intel Technologies
Intel NUC, Intel Opt ML/DL Framework, Intel Python, MKL, DevCloud

Overview / Usage

The initial goal of this project is to reimplement variations of the Jasper network as described in this paper using the functional Keras API within TensorFlow and train them on an Intel NUC. This network has been previously implemented in the OpenSeq2Seq Python package, in the low-level TensorFlow API, and in the PyTorch API. The initial Keras reimplementation serves the purpose of addressing issues of replication within the field of machine learning and neural network research.

Following the initial reimplementation, the best-performing network will be tested against variations in the spectrogram input and the character-level output in kind. One version of the network will test spectral power derived from a wideband spectrogram against English character-level output. Another version will test spectral power derived from a narrowband spectrogram against IPA character-level output. The final version will test spectral power derived from a wideband spectrogram against IPA character-level output.

Methodology / Approach

For the Keras reimplementation, I plan on going off of the documentation provided in the Arxiv paper and the source code of the two previous implementations. The critical barrier to a successful reimplementation is the training involved, which can require a lot of compute, especially when considering the largest model is 54 layers. In order to achieve this, I will be training the models on Intel DevCloud, then performing inference on an Intel NUC. I will begin by training the smallest reported model (19 layers) against the smallest dataset (100 hours of speech from Librispeech), then scaling up training as each model is completed.

Following the initial reimplementation, I plan on testing the effect of variations in the input and output on the training trajectories of the best performing model as outlined in the above summary. The following describes the changes to the input and output in turn.

The original spectral input provides a power spectrum at sequential time steps from a narrowband spectrogram. This type of spectrogram highlights the natural harmonics of the provided acoustic signal. While the use of this type of spectral is used quite often in machine learning, it is rarely used in the field of speech science. Within this field of research wideband spectrograms are primarily used when analysing acoustic speech signals. This type of spectrogram is computed the same way as a narrowband spectrogram save for the window of the FFT computation. When calculating a narrowband spectrogram the time window of computation is generally around 20 ms. However, when computing a wideband spectrogram the time window of computation is generally 5 ms or less, ideally less than the period of the fundamental frequency. Wide spectrograms highlight formant information of the acoustic signal, that is, they provide information on the resonant harmonics of the vocal tract. These resonant harmonics provide critical information on the shape of the vocal tract and the position of speech articulators (e.g., tongue, lips, teeth). This is turn provides explicit information as to what speech sound is occurring. The idea in using this type of input is to provide a signal with higher fidelity to the network, which may improve recognition accuracy.

With respect to variation in the output, I plan on testing the use of the International Phonetic Alphabet (IPA) as the character-level output for calculating connectionist temporal classification (CTC) loss. IPA provides characters for productions of different speech sounds as well as different variations of those production. This allows for a higher degree of specificity in the character-level output.

Both wideband spectral input and IPA character-level output can be used in tandem to provide the highest degree of specificity, which I hypothesize should yield the highest accuracy. However, this degree of specificity might prove detrimental to generalization of the model to instances outside of the trained cases.

Technologies Used

Intel distribution of Python 3.6, Intel distribution of TensorFlow 1.14, Intel DevCloud, Intel NUC with Windows 10.

Comments (0)