An Approach for Acoustic Scene Classification
Prabharoop C C1,†
1 Govt. Engineering College Sreekrishnapuram, Dept. of ECE
ABSTRACT
In this work, we propose an approach for acoustic scene classification. The proposed method uses CNN based pre-trained
network as a feature extractor. Random Projections are then is used to compress these feature vectors. Class-specific
archetypal analysis is employed on the compressed features for acoustic modeling, obtaining the convex-sparse representation.
This representation efficiently captures class-specific discriminative information.
Background & Summary
The goal of acoustic scene classification is to classify a test recording to one of the predefined classes and enabling machines to
make sense of their environment through the analysis of sound, which is the main objective and area of interest in machine
listening. In this study, we will be trying to approach the problem of acoustic scene classification using a different workflow, a
workflow in which we will be using modern Convolutional Neural Network models to extract features and to use conventional
dictionary learning procedures to learn "atoms" or the "perfect sparse" representation of the audio data which in turn could be
used to create a better performing classifier model.
The work done here is inspired by Thakur et al.1 , in which the author tried to identify bird species via a multi-layer
alternating sparse-dense framework. In literature, we can see a lot of works emerging in the area of acoustic scene classification
and machine listening. But one thing to be noticed is that majority of the works we could see in literature are based on features
like spectrogram representations. In this work, we are trying to apply a more flexible and data-driven feature extractor based
operations to represent the acoustic scene data and to classify them.
Convolutional Neural Networks can be considered as one among the many revolutionary ideas emerged during the Industry
4.0. CNN’s are just like normal Deep Neural Networks except regarding the fact that they can transform an input image volume
to an output volume with class scores with very less pre-processing steps as compared to hand-engineered feature extractors.
The CNN’s are particularly designed to solve image processing tasks but further improvements in 1-D Convolutional layers
and 3-D convolutional layers gave rise to the usage of the ConvNets onto domains like audio processing and medical image
diagnostic tasks like in case of 3-D MRI, CT, etc.
As mentioned previously, CNN’s could be used in audio/ speech processing tasks because of the growth in the 1-D
Convolution concept. One of the major works in the usage of ConvNets for acoustic scene classification was done by Yusuf
Aytar, Carl Vondrick, and Antonio Torralba on their work, "Soundnet: Learning sound representations from unlabeled video"2 ,
in which they used a teacher-student kind of network architecture to learn sound representation. In Soundnet architecture, a
student-teacher training procedure is enabled which transfers discriminative visual knowledge from well established visual
recognition models into the sound modality using unlabeled video as a bridge. In this study, we will be using Soundnet derived
features to obtain audio representations. The usage of Soundnet derived features allows us to gather audio representations in a
much more flexible data-driven methodology.
Methods
In the proposed methodology, we will be using, as mentioned, a pre-trained Soundnet model to extract the audio representations.
Soundnet is an 8 layered Convolutional Neural Network that was trained according to a student-teacher training procedure
from the well established visual results. In this proposed study, we had used the features obtained from the 5th layer of the 8
layered convolutional neural networks. The choice of the 5th layer specifically was just for the representational convenience
and according to empirical results.
The features obtained from the 5th layer of Soundnet architecture were then subjected to class-wise concatenation which
created a very high dimensional output. Processing this very high dimensional array was computationally expensive. To
tackle this situation, we employed random projection-based dimensionality reduction techniques, by following the Johnson-
Lindenstrauss lemma. For computational convinience, we reduces the dimensionality by projecting the original input space
using a sparse random matrix. Sparse random matrices are an alternative to dense Gaussian random projection matrix that
guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected
data.
Detailed representation of how the Soundnet Derived features embed the features of the input audio had been provided in
the 1 and 2. For representational convenience, the graphs attached are from the "Urban" class audios in the "Making Sense of
Sounds" challenge. The particular dataset was chosen just because of the interpretability of the individual audios in the dataset.
On our work, we had used the much more challenging and noisy, LITIS Rouen audio scene dataset.
After concatenation and further dimensionality reduction, we employed convex hull based methods to learn dictionaries
from the corresponding input data classes. Here, in this study, we had used Archetypal Analysis-based methods to learn the
dictionaries of each class in our dataset. Archetypal analysis has the aim to represent observations in a multivariate data set as
convex combinations of extremal points. An archetype can be defined as the original pattern or model of which all things of
the same type are representations or copies. The archetypal analysis aims to find “pure types”, the archetypes, within a set
defined in a specific context. Here, the concrete problem is to find a few, not necessarily observed, points (archetypes) in a set
of multivariate observations such that all the data can be well represented as convex combinations of the archetypes. Here, in
this work, we had used methodology presented in this paper3 .
Using Archetypal Analysis, for each class, individual dictionaries had been learned. The final dictionary is obtained by
combining the whole individual class-based dictionaries. Archetypal Analysis could be seen as a Non Negative factorization
technique in which we could formulate as X = DA, which X could be seen as our input feature vector, D as our learned
dictionary and A as our activations, or the perfect sparse representations. The activation for any such input feature vector could
be obtained by projecting the feature vector onto a simplex corresponding to the dictionary D. This activation representation
has strong class-specific signatures that can be used as a feature representation to our particular task. In this work, we will be
using the active set algorithm specified in this paper3 for obtaining this class-specific [Link] representational diagrams
for the same are also attached on 3.
In this work, we had used the activations mentioned in the previous slide as features to represent the input audio data. A
Linear SVM with tuned hyperparameters were chosen for classification purposes.
Results
In this work, a subset of an equal number of audios from the LITIS Rouen Audio scene dataset had been used. Each audio was
of 30s duration and a sampling rate of 44.1kHz had been used for this study. The conv5 layer of the Soundnet architecture
had been used to obtain the feature representations and then class-specific features were concatenated. After concatenation,
the features were then subjected to Sparse based random projection techniques for dimensionality reduction. After this step,
the archetypal analysis had been employed to learn class-specific dictionaries, with specifying 64 atoms per dictionary. The
value of 64 was chosen empirically. The main dictionary had been formed by joining the individual class-specific dictionaries
together. The activation for any such input feature vector could be obtained by projecting the feature vector onto a simplex
corresponding to the dictionary D. The activations we obtained from the projection is then labeled segment by segment and
then was passed to the classifier. On the classification part, the Support vector machine classifier was used with a linear kernel.
The hyperparameters of the classifier were tuned via an exhaustive search over a set of parameters.
The classification model received a test-set accuracy of 95.12%, average precision score of 97%, average recall score
of 95% and an average f1-score of 96%. While training, the one-vs-rest strategy was [Link] confusion matrix of the
classification task is attached on the 4.
References
1. Anshul Thakur, P. S., Vinayak Abrol & Rajan, P. Local compressed convex spectral embedding for bird species identification.
[Link] (2018).
2. Aytar, Y., Vondrick, C. & Torralba, A. Soundnet: Learning sound representations from unlabeled video. In Advances in
Neural Information Processing Systems (2016).
3. Chen, J. M. Y. & Harchaoui, Z. Fast and robust archetypal analysis for representation learning. In In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014).
Figures
In this section, representational figures and plots are included.
2/4
Figure 2. Figure shows the feature representation of the audio
file showed in Figure 1. The plot shows the conv5 feature
Figure 1. Figure shows the plot of a sample audio file from
representation from the Soundnet [Link] can see here
the Urban class in MSoS data challenge. Here x axis-
that the features are activated only on portions where audio is
represents the time and the y axis represents the amplitude
present. Here the x axis could be visualized as the one which
corresponds to the temporal axis.
Figure 3. Figure shows the activation’s obtained via
projection of feature vector to the simplex of the Dictionary.
Here, we visualize the activation’s of the first class of feature
vectors. We had learned 64 atoms per class and as we can see
in the representation, only the initial portion got activated -
from the main dictionary D, which corresponds to the 64- Figure 4. Figure shows the confusion matrix obtained as a
atoms of the first class. Here the x axis represents the result of the classification.
coefficient index and y axis the corresponding amplitude.
3/4
To-Do
Even though the work and study is almost completed, we would like to add a few more extra points to the current study,
• To get a quantitative measure on how good the Soundnet architecture, on a standalone stage can perform on our present
task.
• Instead of concatenating the feature vectors, to use clustering methods and to see how well the results vary
• To add more metrics to the evaluation of the classifier.
• To see how the performance is varied if we change the parameters like number of dictionaries learned and the amount of
reduction of dimensions of the feature vector.
4/4