Acoustic Scene Classification Method

Uploaded by

irshadmk3399

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views4 pages

Acoustic Scene Classification Method

Uploaded by

irshadmk3399

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

An Approach for Acoustic Scene Classification

Prabharoop C C1,†
1 Govt. Engineering College Sreekrishnapuram, Dept. of ECE

ABSTRACT

In this work, we propose an approach for acoustic scene classification. The proposed method uses CNN based pre-trained
network as a feature extractor. Random Projections are then is used to compress these feature vectors. Class-specific
archetypal analysis is employed on the compressed features for acoustic modeling, obtaining the convex-sparse representation.
This representation efficiently captures class-specific discriminative information.

Background & Summary

The goal of acoustic scene classification is to classify a test recording to one of the predefined classes and enabling machines to
make sense of their environment through the analysis of sound, which is the main objective and area of interest in machine
listening. In this study, we will be trying to approach the problem of acoustic scene classification using a different workflow, a
workflow in which we will be using modern Convolutional Neural Network models to extract features and to use conventional
dictionary learning procedures to learn "atoms" or the "perfect sparse" representation of the audio data which in turn could be
used to create a better performing classifier model.
The work done here is inspired by Thakur et al.1 , in which the author tried to identify bird species via a multi-layer
alternating sparse-dense framework. In literature, we can see a lot of works emerging in the area of acoustic scene classification
and machine listening. But one thing to be noticed is that majority of the works we could see in literature are based on features
like spectrogram representations. In this work, we are trying to apply a more flexible and data-driven feature extractor based
operations to represent the acoustic scene data and to classify them.
Convolutional Neural Networks can be considered as one among the many revolutionary ideas emerged during the Industry
4.0. CNN’s are just like normal Deep Neural Networks except regarding the fact that they can transform an input image volume
to an output volume with class scores with very less pre-processing steps as compared to hand-engineered feature extractors.
The CNN’s are particularly designed to solve image processing tasks but further improvements in 1-D Convolutional layers
and 3-D convolutional layers gave rise to the usage of the ConvNets onto domains like audio processing and medical image
diagnostic tasks like in case of 3-D MRI, CT, etc.
As mentioned previously, CNN’s could be used in audio/ speech processing tasks because of the growth in the 1-D
Convolution concept. One of the major works in the usage of ConvNets for acoustic scene classification was done by Yusuf
Aytar, Carl Vondrick, and Antonio Torralba on their work, "Soundnet: Learning sound representations from unlabeled video"2 ,
in which they used a teacher-student kind of network architecture to learn sound representation. In Soundnet architecture, a
student-teacher training procedure is enabled which transfers discriminative visual knowledge from well established visual
recognition models into the sound modality using unlabeled video as a bridge. In this study, we will be using Soundnet derived
features to obtain audio representations. The usage of Soundnet derived features allows us to gather audio representations in a
much more flexible data-driven methodology.

Methods
In the proposed methodology, we will be using, as mentioned, a pre-trained Soundnet model to extract the audio representations.
Soundnet is an 8 layered Convolutional Neural Network that was trained according to a student-teacher training procedure
from the well established visual results. In this proposed study, we had used the features obtained from the 5th layer of the 8
layered convolutional neural networks. The choice of the 5th layer specifically was just for the representational convenience
and according to empirical results.
The features obtained from the 5th layer of Soundnet architecture were then subjected to class-wise concatenation which
created a very high dimensional output. Processing this very high dimensional array was computationally expensive. To
tackle this situation, we employed random projection-based dimensionality reduction techniques, by following the Johnson-
Lindenstrauss lemma. For computational convinience, we reduces the dimensionality by projecting the original input space
using a sparse random matrix. Sparse random matrices are an alternative to dense Gaussian random projection matrix that
guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected
data.
Detailed representation of how the Soundnet Derived features embed the features of the input audio had been provided in
the 1 and 2. For representational convenience, the graphs attached are from the "Urban" class audios in the "Making Sense of
Sounds" challenge. The particular dataset was chosen just because of the interpretability of the individual audios in the dataset.
On our work, we had used the much more challenging and noisy, LITIS Rouen audio scene dataset.
After concatenation and further dimensionality reduction, we employed convex hull based methods to learn dictionaries
from the corresponding input data classes. Here, in this study, we had used Archetypal Analysis-based methods to learn the
dictionaries of each class in our dataset. Archetypal analysis has the aim to represent observations in a multivariate data set as
convex combinations of extremal points. An archetype can be defined as the original pattern or model of which all things of
the same type are representations or copies. The archetypal analysis aims to find “pure types”, the archetypes, within a set
defined in a specific context. Here, the concrete problem is to find a few, not necessarily observed, points (archetypes) in a set
of multivariate observations such that all the data can be well represented as convex combinations of the archetypes. Here, in
this work, we had used methodology presented in this paper3 .
Using Archetypal Analysis, for each class, individual dictionaries had been learned. The final dictionary is obtained by
combining the whole individual class-based dictionaries. Archetypal Analysis could be seen as a Non Negative factorization
technique in which we could formulate as X = DA, which X could be seen as our input feature vector, D as our learned
dictionary and A as our activations, or the perfect sparse representations. The activation for any such input feature vector could
be obtained by projecting the feature vector onto a simplex corresponding to the dictionary D. This activation representation
has strong class-specific signatures that can be used as a feature representation to our particular task. In this work, we will be
using the active set algorithm specified in this paper3 for obtaining this class-specific [Link] representational diagrams
for the same are also attached on 3.
In this work, we had used the activations mentioned in the previous slide as features to represent the input audio data. A
Linear SVM with tuned hyperparameters were chosen for classification purposes.

Results
In this work, a subset of an equal number of audios from the LITIS Rouen Audio scene dataset had been used. Each audio was
of 30s duration and a sampling rate of 44.1kHz had been used for this study. The conv5 layer of the Soundnet architecture
had been used to obtain the feature representations and then class-specific features were concatenated. After concatenation,
the features were then subjected to Sparse based random projection techniques for dimensionality reduction. After this step,
the archetypal analysis had been employed to learn class-specific dictionaries, with specifying 64 atoms per dictionary. The
value of 64 was chosen empirically. The main dictionary had been formed by joining the individual class-specific dictionaries
together. The activation for any such input feature vector could be obtained by projecting the feature vector onto a simplex
corresponding to the dictionary D. The activations we obtained from the projection is then labeled segment by segment and
then was passed to the classifier. On the classification part, the Support vector machine classifier was used with a linear kernel.
The hyperparameters of the classifier were tuned via an exhaustive search over a set of parameters.
The classification model received a test-set accuracy of 95.12%, average precision score of 97%, average recall score
of 95% and an average f1-score of 96%. While training, the one-vs-rest strategy was [Link] confusion matrix of the
classification task is attached on the 4.

References
1. Anshul Thakur, P. S., Vinayak Abrol & Rajan, P. Local compressed convex spectral embedding for bird species identification.
[Link] (2018).
2. Aytar, Y., Vondrick, C. & Torralba, A. Soundnet: Learning sound representations from unlabeled video. In Advances in
Neural Information Processing Systems (2016).
3. Chen, J. M. Y. & Harchaoui, Z. Fast and robust archetypal analysis for representation learning. In In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014).

Figures
In this section, representational figures and plots are included.

2/4
Figure 2. Figure shows the feature representation of the audio
file showed in Figure 1. The plot shows the conv5 feature
Figure 1. Figure shows the plot of a sample audio file from
representation from the Soundnet [Link] can see here
the Urban class in MSoS data challenge. Here x axis-
that the features are activated only on portions where audio is
represents the time and the y axis represents the amplitude
present. Here the x axis could be visualized as the one which
corresponds to the temporal axis.

Figure 3. Figure shows the activation’s obtained via

projection of feature vector to the simplex of the Dictionary.
Here, we visualize the activation’s of the first class of feature
vectors. We had learned 64 atoms per class and as we can see
in the representation, only the initial portion got activated -
from the main dictionary D, which corresponds to the 64- Figure 4. Figure shows the confusion matrix obtained as a
atoms of the first class. Here the x axis represents the result of the classification.
coefficient index and y axis the corresponding amplitude.

3/4
To-Do
Even though the work and study is almost completed, we would like to add a few more extra points to the current study,
• To get a quantitative measure on how good the Soundnet architecture, on a standalone stage can perform on our present
task.
• Instead of concatenating the feature vectors, to use clustering methods and to see how well the results vary

• To add more metrics to the evaluation of the classifier.

• To see how the performance is varied if we change the parameters like number of dictionaries learned and the amount of
reduction of dimensions of the feature vector.

4/4

Nietjet 0602S 2018 003
No ratings yet
Nietjet 0602S 2018 003
5 pages
DL For Acoustics
No ratings yet
DL For Acoustics
4 pages
Mrac Paper1a
No ratings yet
Mrac Paper1a
11 pages
Data Leakage in SoundDesc Dataset
No ratings yet
Data Leakage in SoundDesc Dataset
5 pages
Deep Learning for Audio Event Detection
No ratings yet
Deep Learning for Audio Event Detection
16 pages
A Robust Audio Deepfake Detection System Via Multi-View Feature
No ratings yet
A Robust Audio Deepfake Detection System Via Multi-View Feature
5 pages
Deep Learning for Music Processing
No ratings yet
Deep Learning for Music Processing
152 pages
Audio Classification
No ratings yet
Audio Classification
6 pages
Vision Transformer Based Audio Classification Using Patch-Level Feature Fusion
No ratings yet
Vision Transformer Based Audio Classification Using Patch-Level Feature Fusion
5 pages
Multimedia Auditory Signal Analysis
No ratings yet
Multimedia Auditory Signal Analysis
17 pages
Classification of Vehicles Based On Audio Signals Using Quadratic Discriminant Analysis and High Energy Feature Vectors
No ratings yet
Classification of Vehicles Based On Audio Signals Using Quadratic Discriminant Analysis and High Energy Feature Vectors
19 pages
Audio Recognition with Deep Learning
No ratings yet
Audio Recognition with Deep Learning
52 pages
Artificial Neural Network Application For The Temporal Properties of Acoustic Perception
No ratings yet
Artificial Neural Network Application For The Temporal Properties of Acoustic Perception
12 pages
Urban Sound Classification For Audio Analysis Using Long Short-Term Memory
No ratings yet
Urban Sound Classification For Audio Analysis Using Long Short-Term Memory
11 pages
Paper 4-Enhancing Audio Classification Through MFCC
No ratings yet
Paper 4-Enhancing Audio Classification Through MFCC
17 pages
Development of Large Annotated Music Datasets Usin
No ratings yet
Development of Large Annotated Music Datasets Usin
7 pages
Unsupervised Features Learning For Audio Analysis
No ratings yet
Unsupervised Features Learning For Audio Analysis
4 pages
Supervised and Unsupervised Learning of Audio Representations For Music Understanding
No ratings yet
Supervised and Unsupervised Learning of Audio Representations For Music Understanding
8 pages
Musical Instrument Timbre Classification
100% (1)
Musical Instrument Timbre Classification
10 pages
Supervised and Unsupervised Learning of Audio Representations For Music Understanding
No ratings yet
Supervised and Unsupervised Learning of Audio Representations For Music Understanding
8 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
Sound Classification
No ratings yet
Sound Classification
5 pages
Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter Balazs, and Martin Ehler
No ratings yet
Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter Balazs, and Martin Ehler
5 pages
Seongkyu Mun, Suwon Shon, Wooil Kim, David K. Han, Hanseok Ko
No ratings yet
Seongkyu Mun, Suwon Shon, Wooil Kim, David K. Han, Hanseok Ko
5 pages
Panns: Large-Scale Pretrained Audio Neural Networks For Audio Pattern Recognition
No ratings yet
Panns: Large-Scale Pretrained Audio Neural Networks For Audio Pattern Recognition
15 pages
Applsci 14 02321
No ratings yet
Applsci 14 02321
6 pages
Distortion Discriminant Analysis for Audio
No ratings yet
Distortion Discriminant Analysis for Audio
10 pages
Gao Learning To Separate CVPR 2018 Paper
No ratings yet
Gao Learning To Separate CVPR 2018 Paper
4 pages
Feature Analysis and Extraction For Audio Automatic Classification
No ratings yet
Feature Analysis and Extraction For Audio Automatic Classification
6 pages
Wavelet Based Acoustic Detection of Moving Vehicles
No ratings yet
Wavelet Based Acoustic Detection of Moving Vehicles
25 pages
Multilabel Genre Classification
No ratings yet
Multilabel Genre Classification
17 pages
5 Sgasgs
No ratings yet
5 Sgasgs
6 pages
Aria-MIDI A Dataset of Piano MIDI Files For Symbol
No ratings yet
Aria-MIDI A Dataset of Piano MIDI Files For Symbol
17 pages
Content-Based Classification of Musical Instrument Timbres: Agostini Longari Pollastri
100% (1)
Content-Based Classification of Musical Instrument Timbres: Agostini Longari Pollastri
8 pages
Comparing Acoustic Feature Extraction Techniques
No ratings yet
Comparing Acoustic Feature Extraction Techniques
4 pages
Pad Assignment 2
No ratings yet
Pad Assignment 2
12 pages
Spectrogram Transformers For Audio Classification
No ratings yet
Spectrogram Transformers For Audio Classification
7 pages
Audio Chord Recognition With Recurrent Neural Networks
No ratings yet
Audio Chord Recognition With Recurrent Neural Networks
6 pages
Audio Object Detection with VGGish
No ratings yet
Audio Object Detection with VGGish
6 pages
Voice Identification Using Machine Learning Models
No ratings yet
Voice Identification Using Machine Learning Models
4 pages
Feature Extraction From Speech Spectrograms Mu1 Ti-Layered Network Models
No ratings yet
Feature Extraction From Speech Spectrograms Mu1 Ti-Layered Network Models
7 pages
Broadcast Audio Recognition Study
No ratings yet
Broadcast Audio Recognition Study
4 pages
Acoustics 07 00033c
No ratings yet
Acoustics 07 00033c
28 pages
Deep BiDirec Transformers-Base Masked Predictive
No ratings yet
Deep BiDirec Transformers-Base Masked Predictive
17 pages
10 Audio Processing Tasks To Get You Started With Deep Learning Applications (With Case Studies)
No ratings yet
10 Audio Processing Tasks To Get You Started With Deep Learning Applications (With Case Studies)
5 pages
cmmr2021 24
No ratings yet
cmmr2021 24
10 pages
Bird Clef Synopsis
No ratings yet
Bird Clef Synopsis
12 pages
Audio Synthesis with MelGAN
No ratings yet
Audio Synthesis with MelGAN
14 pages
Timbre Id
No ratings yet
Timbre Id
6 pages
Deep Learning and Music Adversaries
No ratings yet
Deep Learning and Music Adversaries
13 pages
Audio-To-Visual Cross-Modal Generation of Birds
No ratings yet
Audio-To-Visual Cross-Modal Generation of Birds
11 pages
Lightweight 1D CNN for Sound Classification
No ratings yet
Lightweight 1D CNN for Sound Classification
10 pages
Automated Bird Species Identification
No ratings yet
Automated Bird Species Identification
16 pages
Wei 2020 J. Phys. - Conf. Ser. 1453 012085
No ratings yet
Wei 2020 J. Phys. - Conf. Ser. 1453 012085
9 pages
Prototypical Networks For Domain Adaptation in Acoustic Scene Classification
No ratings yet
Prototypical Networks For Domain Adaptation in Acoustic Scene Classification
5 pages
Reconvat: A Semi-Supervised Automatic Music Transcription Framework For Low-Resource Real-World Data
No ratings yet
Reconvat: A Semi-Supervised Automatic Music Transcription Framework For Low-Resource Real-World Data
9 pages
Pre-Board I Syllabus 2022
No ratings yet
Pre-Board I Syllabus 2022
13 pages
Mathematical Models of Analogous Systems
No ratings yet
Mathematical Models of Analogous Systems
12 pages
Benford's Law in Fraud Detection
No ratings yet
Benford's Law in Fraud Detection
13 pages
Smart Craft Manual
100% (2)
Smart Craft Manual
30 pages
NBES140100003101 Unlocked
No ratings yet
NBES140100003101 Unlocked
107 pages
Mahmoudi Et Al. S II ASR 71 1281 2023
No ratings yet
Mahmoudi Et Al. S II ASR 71 1281 2023
6 pages
C File Handling for Beginners
100% (1)
C File Handling for Beginners
10 pages
Error Log for F43.2 Diagnosis
No ratings yet
Error Log for F43.2 Diagnosis
26 pages
Quattro Pro and Lotus 123
No ratings yet
Quattro Pro and Lotus 123
18 pages
Helmet Detection Using YOLOS in Indonesia
No ratings yet
Helmet Detection Using YOLOS in Indonesia
5 pages
Rotary Screw Compressor Guide
100% (4)
Rotary Screw Compressor Guide
22 pages
Introducing XML: Beginning XML Joe Fawcett, Liam R.E. Quin, and Danny Ayers John Wiley & Sons, Inc., 2012
No ratings yet
Introducing XML: Beginning XML Joe Fawcett, Liam R.E. Quin, and Danny Ayers John Wiley & Sons, Inc., 2012
296 pages
MC0063 Discrete Mathematics Model Question Paper
No ratings yet
MC0063 Discrete Mathematics Model Question Paper
26 pages
Fisher Y690a
No ratings yet
Fisher Y690a
12 pages
Ve564 Specifications Sercel
No ratings yet
Ve564 Specifications Sercel
2 pages
Hydrus 2.0 Bulk: User Guide
No ratings yet
Hydrus 2.0 Bulk: User Guide
23 pages
VAM Recommended Storage Running Compound Tables - Rev May 2014
No ratings yet
VAM Recommended Storage Running Compound Tables - Rev May 2014
1 page
Errors Gyro Compass
No ratings yet
Errors Gyro Compass
11 pages
Furnace PDF
No ratings yet
Furnace PDF
32 pages
SANYO UR18650A 2.2ah Specifications
100% (1)
SANYO UR18650A 2.2ah Specifications
18 pages
Online Banking
No ratings yet
Online Banking
49 pages
KK Ack
No ratings yet
KK Ack
16 pages
Subsurface Geological Mapping in Nigeria
No ratings yet
Subsurface Geological Mapping in Nigeria
11 pages
File Module
No ratings yet
File Module
10 pages
Bio Codoped BCZT
No ratings yet
Bio Codoped BCZT
10 pages
Lecture Notes in Artificial Intelligence 14294
No ratings yet
Lecture Notes in Artificial Intelligence 14294
15 pages
Understanding Stress and Strain in Materials
No ratings yet
Understanding Stress and Strain in Materials
15 pages
دور الاساليب الكمية في فحص القوائم المرحلية
No ratings yet
دور الاساليب الكمية في فحص القوائم المرحلية
20 pages
TSP - Iii (Aim & Algorithm) 11
No ratings yet
TSP - Iii (Aim & Algorithm) 11
18 pages
Java Program Example - Print Table of Number
No ratings yet
Java Program Example - Print Table of Number
4 pages