Monaural Audio Source Separation using Wave-U-Net & Deep Convolutional Neural Networks

Tanmay Bhagwat

Tanmay Bhagwat

Thane, Maharashtra

2 0
  • 0 Collaborators

The project aims to develop a system capable of analysing a mastered album track and discern the various audio sources in it. Audio from disparate sources will be separated and provided as outputs individually. It uses Wave-U-Net technology with underlying Deep CNNs to operate on MUSDB song dataset. ...learn more

Project status: Under Development

Artificial Intelligence, Graphics and Media

Groups
Student Developers for AI

Intel Technologies
Intel Integrated Graphics

Code Samples [1]Links [1]

Overview / Usage

Audio Source Separation is an ongoing research topic which deals with discerning various sources of audio in a sample. It has major applications in investigation, authentication as well as entertainment sectors. Our project has its application in the entertainment sector, precisely music. A mastered studio-recorded audio is usually a coalescence of multiple audio tracks from vocals as well as instruments. The project will be able to discriminate each source of audio through Wave-U-Net Deep Convolutional Technology. Thus, the user would obtain a separate audio clip for each instrument and vocal. These findings from an audio dataset can be clustered to determine phonetically similar songs, i.e. songs that sound similar and use similar instruments. This will result in songs with similar moods and rhythms to be grouped together. Currently, audio giants like Spotify or iTunes uses genre, artist and album information of a song to recommend songs to the user, whereas our underlying system aims to recommend songs through deep analysis of the clip itself, thus considerably increasing the accuracy. Better recommendations, in turn, result in higher user engagement and hence, higher revenue.

Methodology / Approach

Generally, audio source separation has been carried out by analysing spectrograms of frequency samples. Wavenet, a progressive technology for audio signal processing uses the magnitude information obtained from a spectrogram to discern various audio sources. However, the entirety of phase information is discarded. Phase data of an audio spectrogram consists of overlap information from the usage of Short term Fourier Transform which is essential to obtain an accurate result. The methodology we're using, Wave-U-Net, integrates the phase information also, to improve the results. Deep Convolutional networks in association with Mel Filterbank are potent to generate ideal audio waveforms and spectrogram samples from training over an entire audio dataset. Hence, we receive a sample frequency waveform for each source in the audio which can, in turn, be labelled through classification techniques like Bayesian ML or Support Vector Machine. Segregated results of each sample can be clustered together using DBScan or K-Means Specialized Algorithm to obtain reliable groups of audio.

Technologies Used

  • Wave-U-Net
  • Python 2.x
  • CUDA (GPU cores)
  • Spyder- Anaconda Navigator (Python 2.x)
  • Tensorflow (Theano)
  • Keras
  • Librosa- Music Library for Python
  • Mel Filterbank
  • Google Colab- Testing
  • MUSDB Sampled Audio Dataset

Repository

https://github.com/frizzid07/Wave-U-Net

Comments (0)