10 Audio Processing Tasks to get you started with Deep Learning
Applications (with Case Studies)
A UD I O PRO C E S S I NG D E E P LE A RNI NG I NT E RM E D I AT E LI S T I C LE PYT HO N S O UND PRO C E S S I NG US E C A S E S
Introduction
Imagine a world where machines understand what you want and how you are feeling when you call at a
customer care – if you are unhappy about something, you speak to a person quickly. If you are looking for
a specific information, you may not need to talk to a person (unless you want to!).
This is going to be the new order of the world – you can already see this happening to a good degree.
Check out the highlights of 2017 in the data science industry. You can see the breakthroughs that deep
learning was bringing in a field which were difficult to solve before. One such field that deep learning has a
potential to help solving is audio/speech processing, especially due to its unstructured nature and vast
impact.
So for the curious ones out there, I have compiled a list of tasks that are worth getting your hands dirty
when starting out in audio processing. I’m sure there would be a few more breakthroughs in time to come
using Deep Learning.
The article is structured to explain each task and its importance. There is also a research paper that goes
in the details of that specific task, along with a case study that would help you get started in solving the
task.
So let’s get cracking!
1. Audio Classification
Audio classification is a fundamental problem in the field of audio processing. The task is essentially to
extract features from the audio, and then identify which class the audio belongs to. Many useful
applications pertaining to audio classification can be found in the wild – such as genre classification,
instrument recognition and artist identification.
This task is also the most explored topic in audio processing. Plenty of papers were published in this field
in the last year. In fact, we have also hosted a practice hackathon for community collaboration for solving
this particular task.
Whitepaper – http://ieeexplore.ieee.org/document/5664796/?reload=true
A common approach to solve an audio classification task is to pre-process the audio inputs to extract
useful features, and then apply a classification algorithm on it. For example, in the case study below we are
given a 5 second excerpt of a sound, and the task is to identify which class does it belong to – whether it
is a dog barking or a drilling sound. As mentioned in the article, an approach to deal with this is to extract
an audio feature called MFCC and then pass it though a neural network to get the appropriate class.
Case Study – https://www.analyticsvidhya.com/blog/2017/08/audio-voice-processing-deep-learning/
2. Audio Fingerprinting
The aim of audio fingerprinting is to determine the digital “summary” of the audio. This is done to identify
the audio from an audio sample. Shazam is an excellent example of an application of audio fingerprinting.
It recognises the music on the basis of the first two to five seconds of a song. However, there are still
situations where the system fails, especially where there is a high amount of background noise.
Whitepaper – http://www.cs.toronto.edu/~dross/ChandrasekharSharifiRoss_ISMIR2011.pdf
To solve this problem, an approach could be to represent the audio in a different manner, so that it is easily
deciphered. Then, we can find out the patterns that differentiate the audio from the background noise. In
the case study below, the author converts raw audio to spectrograms and then uses peak finding and
fingerprint hashing algorithms to define the fingerprints of that audio file.
Case Study – http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
3. Automatic Music Tagging
Music Tagging is a more complex version of audio classification. Here, we can have multiple classes that
each audio may belong to, aka, a multi-label classification problem. A potential application of this task can
be to create metadata for the audio so that it can be searched later on. Deep learning has helped solve this
task to a certain extent which can be seen in the case study below.
Whitepaper – https://link.springer.com/article/10.1007/s10462-012-9362-y
As seen with most of the tasks, the first step is always to extract features from the audio sample. Then,
sort it according to the nuances of the audio (for example, if the audio contains more instrumental noise
than the singer’s voice, the tag could be “instrumental”). This can be done either by machine learning or
deep learning methods. The case study mentioned below uses deep learning to solve the problem,
specifically convolution recurrent neural network along with Mel Frequency Extraction.
Case Study – https://github.com/keunwoochoi/music-auto_tagging-keras
4. Audio Segmentation
Segmentation literally means dividing a particular object into parts (or segments) based on a defined set of
characteristics. Segmentation, especially for audio data analysis, is an important pre-processing step. This
is because we can segment a noisy and lengthy audio signal into short homogeneous segments (handy
short sequences of audio) which are used for further processing. An application of the task is heart sound
segmentation, i.e. to identify sounds specific to the heart.
Whitepaper – http://www.mecs-press.org/ijitcs/ijitcs-v6-n11/IJITCS-V6-N11-1.pdf
We can convert this into a supervised learning problem, where each time stamp can be classified on the
basis of the segments required. Then we can apply an audio classification approach to solve the problem.
In the case study below, the task is to segment the heart sound into two segments (lub and dub), so that
we can identify an anomaly in each segment. It can be solved by using audio feature extraction and then
deep learning can be applied for classification.
Case Study – https://www.analyticsvidhya.com/blog/2017/11/heart-sound-segmentation-deep-learning/
5. Audio Source Separation
Audio Source Separation consists of isolating one or more source signals from a mixture of signals. One of
the most common applications of this is identifying the lyrics from the audio for simultaneous translation
(karaoke, for instance). This is a classic example shown in Andrew Ng’s machine learning course where he
separates the sound of the speaker from the background music.
Whitepaper – http://ijcert.org/ems/ijcert_papers/V3I1103.pdf
A typical usage scenario involves:
loading an audio file
computing a time-frequency transform to obtain a spectrogram, and
using some of the source separation algorithm (such as non-negative matrix factorization) to obtain a
time-frequency mask
The mask is then multiplied with the spectrogram and the result is converted back to the time domain.
Case Study – https://github.com/IoSR-Surrey/untwist
6. Beat Tracking
As the name suggests, the goal here is to track the location of each beat in a collection of audio files. Beat
tracking can be utilized to automate time-consuming tasks that must be completed in order to synchronize
events with music. It is useful in various applications, such as video editing, audio editing, and human-
computer improvisation.
Whitepaper – https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/01-
students/2012_GroschePeter_MusicSignalProcessing_PhD-Thesis.pdf
An approach to solve beat tracking can be to be parse the audio file and use an onset detection algorithm
to track the beats. Although the techniques used to for onset detection rely heavily on audio feature
engineering and machine learning, deep learning can easily be used here to optimize the results.
Case Study – https://github.com/adamstark/BTrack
7. Music Recommendation
Thanks to the internet, we now have millions of songs we can listen to anytime. Ironically, this has made it
even harder to discover new music because of the plethora of options out there. Music recommendation
systems help deal with this information overload by automatically recommending new music to listeners.
Content providers like Spotify and Saavn have developed highly sophisticated music recommendation
engines. These models leverage the user’s past listening history among many other features to build
customized recommendation lists.
Whitepaper – https://pdfs.semanticscholar.org/7442/c1ebd6c9ceafa8979f683c5b1584d659b728.pdf
We can tackle the challenge of customizing listening preferences by training a regression/deep learning
model. This can be used to predict the latent representations of songs that were obtained from a
collaborative filtering model. This way, we could predict the representation of a song in the collaborative
filtering space, even if no usage data was available.
Case Study – http://benanne.github.io/2014/08/05/spotify-cnns.html
8. Music Retrieval
One of the most difficult tasks in audio processing, Music Retrieval essentially aims to build a search
engine based on audio. Although we can do this by solving sub-tasks like audio fingerprinting, this task
encompasses much more that that. For example, we also have to solve different smaller tasks for different
types of music retrieval (timbre detection would be great for gender identification). Currently, there is no
other system that has been developed to match the industry expected standards.
Whitepaper – http://www.nowpublishers.com/article/Details/INR-042
The task of music retrieval is divided into smaller and simpler steps, which include tonal analysis (e.g.
melody and harmony) and rhythm or tempo (e.g. beat tracking). Then, on the basis of these individual
analysis, information is extracted which is used for retrieval of similar audio samples.
Case Study – https://youtu.be/oGGVvTgHMHw
9. Music Transcription
Music Transcription is another challenging audio processing task. It comprises of annotating audio and
creating a kind of “sheet” for generating music from it at a later point of time. The manual effort involved in
transcribing music from recordings can be vast. It varies enormously depending on the complexity of the
music, how good our listening skills are and how detailed we want our transcription to be.
Whitepaper – http://ieeexplore.ieee.org/abstract/document/7955698
The approach for music transcription is similar to that of speech recognition, where musical notes are
transcribed into lyrical excerpts of instruments.
Case Study – https://youtu.be/9boJ-Ai6QFM
10. Onset Detection
Onset detection is the first step in analysing an audio/music sequence. For most of the tasks mentioned
above, it is somewhat necessary to perform onset detection, i.e. detecting the start of an audio event.
Onset detection was essentially the first task that researchers intended to solve in audio processing.
Whitepaper – http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.989&rep=rep1&type=pdf
Onset detection is typically done by:
computing a spectral novelty function
finding peaks in the spectral novelty function
backtracking from each peak to a preceding local minimum. Backtracking can be useful for finding
segmentation points such that the onset occurs shortly after the beginning of the segment
Case Study – https://musicinformationretrieval.com/onset_detection.html
End Notes
In this article, I have mentioned a few tasks that can be looked at when solving audio processing
problems. I hope you find the article insightful in dealing with audio/speech related projects.
Learn, engage , hack and get hired!
Article Url - https://www.analyticsvidhya.com/blog/2018/01/10-audio-processing-projects-applications/
Faizan Shaikh
Faizan is a Data Science enthusiast and a Deep learning rookie. A recent Comp. Sc. undergrad, he aims
to utilize his skills to push the boundaries of AI research.