Speech Emotion Recognition

Kunal Gehlot

Kunal Gehlot

Jodhpur, Rajasthan

1 0
  • 0 Collaborators

Speech Emotion Recognition based on Deep Learning using Tensorflow CNN, which classifies audio recording into seven emotions i.e. Anger, Fear, Disgust, Happy, Sad, Surprised, Neutral. ...learn more

Project status: Published/In Market

Artificial Intelligence

Groups
Student Developers for AI

Code Samples [1]

Overview / Usage

The Speech Emotion Recognition aims for the service sector, where the Customer representative can know the mood or emotion of the user so that they can use predefined or appropriate approach to connect with them. It is currently being used in the call centres where the representative can handle the customer accordingly.

Methodology / Approach

Databases used:
I used the SAVEE and RAVDESS databases:
· SAVEE: Surrey Audio-Visual Expressed Emotion (SAVEE) database has been recorded as a pre-requisite for the development of an automatic emotion recognition system. The database consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion. The data were recorded in a visual media lab with high-quality audio-visual equipment, processed and labelled.
· RAVDESS: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. The RAVDESS is a validated multimodal database of emotional speech and song. The database is gender-balanced consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and the song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity, with an additional neutral expression. All conditions are available in face-and-voice, face-only, and voice-only formats. The set of 7356 recordings were each rated 10 times on emotional validity, intensity, and genuineness.
Features extracted:
Using the LibROSA Python library. It provides the building blocks necessary to create music information retrieval systems.
Mel-frequency cepstral coefficients (MFCCs)
Ever heard the word cepstral before? Probably not. It’s spectral with the spec reversed! Why though? For a very basic understanding, cepstrum is the information on the rate of change in spectral bands. In the conventional analysis of time signals, any periodic component (for eg, echoes) shows up as sharp peaks in the corresponding frequency spectrum (i.e., Fourier spectrum. This is obtained by applying a Fourier transform on the time signal).
On taking the log of the magnitude of this Fourier spectrum, and then again taking the spectrum of this log by a cosine transformation (I know it sounds complicated, but bear with me please!), we observe a peak wherever there is a periodic element in the original time signal. Since we apply a transform on the frequency spectrum itself, the resulting spectrum is neither in the frequency domain nor in the time domain and hence Bogert et al. decided to call it the quefrency domain. And this spectrum of the log of the spectrum of the time signal was named cepstrum (ta-da!).

**Models used: **
I first extracted features from the Audio by using LibRosa, which are MFCC (Mel-Frequency Cepstral Coefficients) with 39 Mel-Frequencies per audio, and then segregating the data into seven emotions, given: Anger, Fear, Disgust, Happy, Sad, Surprised, Neutral
Then I took two approaches, one of Machine Learning, with model and accuracy below mentioned:
Random Forest
Accuracy: 38.89%
KNeighborsClassifier(9 neighbors)
Accuracy: 35.93%
Concluding that machine learning would not work at all, I took the Deep Learning approach by using 2-dimensional CNN (Convolutional Neural Network)(Conv2D by Keras) and got my best yet result of 86% accuracy:
Conv2D
Accuracy: 86.43%
Thus making the Conv2D the most useful Model to predict emotion.
Testing:
To test the data, a separate program to record audio is made and another one to load the saved models and test the sample audio against the model.

Technologies Used

  • LibROSA
  • PyAudio
  • Tensorflow
  • Keras
  • CNN
  • Scikit Learn
  • Numpy
  • Pandas
  • Python3
  • Tensorflow-GPU
  • 1050Ti

Repository

https://github.com/KunalGehlot/Voice-Emotion-Recognition

Comments (0)