0% found this document useful (0 votes)
90 views12 pages

Deep Convolutional Neural Network With Mixup

This document proposes a deep convolutional neural network with mixup data augmentation for environmental sound classification. The network uses stacked 1D convolutional and pooling layers to extract features from spectrogram-like inputs. Experiments on standard datasets show the proposed approach achieves state-of-the-art performance on UrbanSound8K and competitive results on other datasets. Mixup is also found to improve classification performance and feature distributions.

Uploaded by

sr5160
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views12 pages

Deep Convolutional Neural Network With Mixup

This document proposes a deep convolutional neural network with mixup data augmentation for environmental sound classification. The network uses stacked 1D convolutional and pooling layers to extract features from spectrogram-like inputs. Experiments on standard datasets show the proposed approach achieves state-of-the-art performance on UrbanSound8K and competitive results on other datasets. Mixup is also found to improve classification performance and feature distributions.

Uploaded by

sr5160
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Convolutional Neural Network with Mixup

for Environmental Sound Classification

Zhichao Zhang, Shugong Xu? , Shan Cao, and Shunqing Zhang

Shanghai Institute for Advanced Communication and Data Science,


Shanghai University, Shanghai, 200444, China
arXiv:1808.08405v1 [[Link]] 25 Aug 2018

{zhichaozhang, shugong, cshan, shunqing}@[Link]

Abstract. Environmental sound classification (ESC) is an important


and challenging problem. In contrast to speech, sound events have noise-
like nature and may be produced by a wide variety of sources. In this pa-
per, we propose to use a novel deep convolutional neural network for ESC
tasks. Our network architecture uses stacked convolutional and pooling
layers to extract high-level feature representations from spectrogram-like
features. Furthermore, we apply mixup to ESC tasks and explore its im-
pacts on classification performance and feature distribution. Experiments
were conducted on UrbanSound8K, ESC-50 and ESC-10 datasets. Our
experimental results demonstrated that our ESC system has achieved the
state-of-the-art performance (83.7%) on UrbanSound8K and competitive
performance on ESC-50 and ESC-10.

Keywords: Environmental Sound Classification · Convolutional Neural


Network · Mixup

1 Introduction

Sound recognition is a front and center topic in today’s pattern recognition


theories, which covers a rich variety of fields. Some of sound recognition topics
have made remarkable research progress, such as automatic speech recognition
(ASR) [9,10] and music information retrieval (MIR) [4,31]. Environmental sound
classification (ESC) is an another important branch of sound recognition and is
widely applied in surveillance [21], home automation [33], scene analysis [3] and
machine hearing [14]. However, unlike speech and music, sound events are more
diverse with a wide range of frequencies and often less well defined, which make
ESC tasks more difficult than ASR and MIR. Hence, ESC still faces critical
design issues in performance and accuracy improvement.
Traditional ASR techniques such as MFCC, LPC, PLP are applied directly to
ESC fields in previous works [7,13,16,28]. However, state-of-the-art performance
has been achieved when using more discriminative representations such as Mel
filterbank features [5], Gammatone features [34] and wavelet-based features [8].
?
Corresponding author. Shanghai Institute for Advanced Communication and Data
Science, Shanghai University, Shanghai, China(email: shugong@[Link]).
These features were modeled with some typical machine learning algorithms such
as SVM [32], GMM [17] and KNN [20] for ESC tasks. However, the performance
gain introduced by these approaches is still unsatisfying. One main reason is
that traditional classifiers do not have feature extraction ability.
Over the past few years, deep neural networks (DNNs) have made great
success in ASR and MIR [10,25]. For audio signals, DNNs have ability to extract
features from raw data or hand-draft feature. Therefore, some DNN-based ESC
systems [12, 15] were proposed and performed much better than SVM-based
ESC system. However, deep fully-connected architecture of DNNs is not robust
for transformative features [22]. Some new researchs find convolutional neural
networks (CNNs) have strong abilities to explore inherit and hidden patterns
through huge amount of training data. Several attempts that apply CNN to
ESC have received performance boosts by learning spectrogram-like features
from environment sounds [19, 23, 35]. However, the existing networks for ESC
mostly use shallow architecture, such as 2 convolutional layers [19, 35] and 3
convolutional layers [23]. Getting a more discriminative and powerful information
usually requests a deeper model. Therefore in this paper, we propose an enhanced
CNN architecture with a deeper network based on VGG Net [26]. The main
contributions of this paper includes
– We propose a novel CNN network based on VGG Net. We find that simply
using stacked convolutional layers with 3x3 convolution filters is unsatisfying
in our tasks. So we redesign a novel CNN architecture in our ESC system.
Instead of 3x3 convolution filters, We use 1-D convolution filters to learn local
patterns across frequency and time, respectively. And our method performs
better than CNN using 3x3 convolution filters with same depth of network.
– Mixup is applied in our ESC system for ESC tasks. Every training sample
is created by mixing two examples randomly selected from original training
dataset when using mixup. And the training target is also changed to the mix
ratio. The effectiveness of mixup on classification performance and feature
distribution is then explored further.
– Experiments were conducted on UrbanSound8K, ESC-50 and ESC-10 datasets,
the result of which demonstrated that our ESC system has achieved the state-
of-the-art performance (83.7%) on UrbanSound8K and competitive perfor-
mance on ESC-50 and ESC-10.
The rest of this paper is organized as follows. Recent related works of ESC
are introduced in Section 2. Section 3 provides detailed introduction of our meth-
ods. Section 4 presents the experiments settings on ESC-10, ESC-50 and Urban-
Sound8K datasets, and Section 5 gives both experimental results and detailed
discussions of our results. Finally, Section 6 concludes the paper.

2 Related Work
In this section, we introduce the recent deep learning methods for environmental
sound classification. Piczak [20] proposed to apply CNNs to the log mel spec-
trogram which is calculated for each frame of audio and represents the squared
magnitude of each frequency area. Piczak created a two-channel feature by ap-
plying log mel spectrogram and its delta information as the input of his CNN
model and gave a 20.5% improvement over Random Forest method on ESC-50
dataset. Takahashi et al. [27] also used log mel spectrogram and their delta and
delta-delta information as a three-channel input in a manner similar to the RGB
inputs of the image. Dharmesh et al. [1] used gammatone spectrogram and a sim-
ilar CNN architecture as Piczak [18] and claimed that they achieved 79.1% and
85.34% accuracy on ESC-50 and UrbanSound8K dataset, respectively. However,
since their results were not reproducible, we contacted with the author and real-
ized that the results achieved by them didn’t follow the official cross validation
methods, which means they used different training data and validation data than
main published papers and not comparable. So we will not compare our results
with the results from [1].
Some researchers also proposed to train model directly from raw waveforms.
Dai et al. [6] proposed a deep CNN architecture (up to 34 layers) with 1-D
convolutional layers using 1-D raw data as input and they showed competitive
accuracy with CNN using log mel spectrogram inputs [20]. Tokozume et al. [29]
proposed a end-to-end network named EnvNet using raw data as inputs and
reported EnvNet could extract a discriminative feature that complements the
log mel features. In [30], they constructed a deeper recognition network based
on EnvNet, referred as EnvNet-v2, and achieved better performance.
In addition, some researchers proposed to use external data for sound recog-
nition. Mun et al. [18] proposed a DNN based transfer learning method for ESC.
They first trained a DNN model using merged different web accessible environ-
mental sound datasets. Then, they transferred the parameters of the pre-trained
model and adapted the sound recognition system for target domain task using
additional layers. Aytar et al. [2] proposed to learn rich sound representations
from large amounts of unlabeled sound and videos dataset. They transferred the
knowledge of pre-trained visual recognition network into the sound recognition
network. Then, they used a linear-SVM classifier to classify the feature which
is the output of the hidden layer of the sound recognition network to the target
task.

3 Methods
3.1 Convolutional Neural Network
CNN is a stack of multi-layer neural networks including a group of convolutional
layers, pooling layers and a limited number of fully connected layers. In this
section, we propose a novel CNN as our ESC system model inspired by VGG
Net [26], the architecture of which is presented in Table 1. The proposed CNN
architecture is comprised of eight convolutional layers and two fully connected
layers. We first use 2 convolutional layers with large filter kernals as a basic
feature extractor. Then, we learn local patterns across frequency and time using
3x1 and 1x5 convolution filters, respectively. Next, we use small convolution
filters (3x3) to learn joint time-frequency patterns. Batch normalization [11] is
applied to the output of convolutional layers to speed up training. We use the
Rectified Linear Units (ReLU) to model the non-linearly for the output of each
layer. After every two convolutional layers, a pooling layer is used to reduce
the dimensions of the convolutional features maps, where maximum pooling is
chosen in our network. To reduce the risks of overfitting, the dropout technique
is applied after the first fully connected layers, with the probability of 0.5. L2-
regularization is applied to the weights of each layer with the coefficient 0.0001.
In the output layer, softmax function is used as the activation function which
outputs probabilities of all classes.

Table 1. Configuration of proposed CNN. Out shape represents the dimension in


(frequency, time, channel). Batch Normalization is applied for each convolutional layer.

Layer Ksize Stride Nums of filters Out shape


Input - - - (128, 128, 2)
Conv1 (3, 7) (1, 1) 32 (128, 128, 32)
Conv2 (3, 5) (1, 1) 32 (128, 128, 32)
Pool1 (4, 3) (4, 3) - (32, 43, 32)
Conv3 (3, 1) (1, 1) 64 (32, 43, 64)
Conv4 (3, 1) (1, 1) 64 (32, 43, 64)
Pool2 (4, 1) (4, 1) - (8, 43, 64)
Conv5 (1, 5) (1, 1) 128 (8, 43, 128)
Conv6 (1, 5) (1, 1) 128 (8, 43, 128)
Pool3 (1, 3) (1, 3) - (8, 15, 128)
Conv7 (3, 3) (1, 1) 256 (8, 15, 256)
Conv8 (3, 3) (1, 1) 256 (8, 15, 256)
Pool4 (2, 2) (2, 2) - (4, 8, 256)
FC1 - - 512 (512, )
FC2 - - nums of classes (nums of classes, )

3.2 Mixup
Mixup is an simple but effective method to generate training data [36]. Fig 1
shows the pipeline of mixup. Different from traditional augmentation approaches,
mixup constructs virtual training samples by mixing training samples. Normally,
a model is optimized by using a mini-batch optimization method, such as mini-
batch SGD, and each mini-batch data is selected from the whole original training
data. In mixup, however, each data and label of a mini-batch is generated by
mixing two training samples, which are determined by
(
x̂ = λxi + (1 − λ)xj
(1)
ŷ = λyi + (1 − λ)yj
where xi and xj are two samples randomly selected from training data, and yi
and yj are their one-hot labels. The mix factor λ is decided by a hyper-parameter
Fig. 1. Pipeline of mixup. Every training sample is created by mixing two examples
randomly selected from original training dataset. We use the mixed sound to train the
model and the train target is the mixing ratio.

α and λ ∼ Beta(α, α). Therefore, mixup extends the training data distribution
by mixing various training data within or without the same class by a linear
way, leading to a linear interpolation of the associated targets. Note that we do
not use mixup for testing phase.

4 Experiments
4.1 Dataset
Three publicly available datasets are used for model training and performance
evaluation of the proposed approach, including ESC-10, ESC-50 [20] and Urban-
Sound8K [24], the detailed information of which is shown in Table 2.
The ESC-50 dataset consists of 2000 short environmental records which are
divided into 50 classes in 5 major categories, including animals, natural sound-
scapes and water sounds, human non-speech sounds, interior/domestic sounds,
and exterior/urban noises. All audio samples are 5 seconds with 44.1 kHz sam-
pling frequency.
The ESC-10 dataset is a subset of 10 classes (400 samples) selected from the
ESC-50 dataset (dog bark, rain, sea waves, baby cry, clock tick, person sneeze,
helicopter, chainsaw, rooster, fire crackling).
The UrbanSound8K dataset is a collection of 8732 short (up to 4 seconds)
audio clips of urban sound areas. And the audio clips are prearranged into 10
folds. The dataset is divided into 10 classes: air conditioner, car horn, children
playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street
music.

4.2 Preprocessing
We use a 44.1kHz sampling rate for ESC-10, ESC-50, UrbanSound8K datasets.
All audio samples are normalized into a range from −1 to 1. In order to avoid
overfitting and to effectively utilize the limited data, we use Time Stretch [23]
Table 2. Information of datasets.

Datasets Classes Nums of samples Duration


UrbanSound8K 10 8732 9.7 hours
ESC-50 50 2000 2.8 hours
ESC-10 10 400 33 min

and Pitch Shift [23] deformation methods to generate new audio samples. We
use two spectrogram-like representations, log mel spectrogram (Mels) and gam-
matone spectrogram (GTs). Both features are extracted from all recordings with
hamming window size of 1024, hop length of 512 and 128 bands. Then, the re-
sulting spectrograms are converted into logarithmic scale. In our experiments,
we use a simple energy-based silence drop algorithm to drop silence regions.
Finally, the spectrograms are split into 128 frames (approximately 1.5s) length
with 50% overlap. The delta information of the original spectrogram is calcu-
lated, which is the first temporal derivative of the spectrogram feature. Then,
we use the segments with their deltas as a two-channel input to the network.

4.3 Training settings


All models are trained using mini-batch stochastic gradient descent (SGD) with
Nesterov momentum of 0.9. We used a learning rate decrease schedule with a
initial learning rate of 0.1, and then divided the learning rate by 10 every 80
epoch for UrbanSound8K and 100 epoch for ESC-10 and ESC-50. Every batch
consists of 200 samples randomly selected from training set without repetition.
The models are trained for 200 epochs for UrbanSound8K and 300 epochs for
ESC-50 and ESC-10. We initialize all the weights to zero mean Gaussian noise
with a standard deviation of 0.05. We use cross entropy as the loss function,
which is typically used for multi classification task.
In the test stage, feature extraction and audio cropping patterns are the same
as those used in the training stage. Prediction probability of a test audio sample
is the average of predicted class probability of each segment. The predicted
label of the test audio sample is the class with the highest posterior possibility.
The classification performance of the methods is evaluated by the K-fold cross-
validation. For the ESC-50 and ESC-10 dataset, K is set to 5, while for the
UrbanSound8K dataset, K is set to 10.
All models are trained using Keras library with TensorFlow backend on an
Nvidia P100 GPU with a 12GB memory.

5 Results and Analysis


The classification accuracy of the proposed method compared with recent re-
lated works is shown in Table 3. It can be observed that our method achieved the
state-of-the-art performance (83.7%) on UrbanSound8K dataset and competitive
performance (91.7%, 83.9%) on ESC-10 and ESC-50. The average classification
Table 3. Classification accuracy (%) of different ESC systems. In our ESC system,
we compare two different features with augmentation and without augmentation. ’aug’
stands for augmentation, including Pitch Shift, Time Stretch. Note that we will not
compare with the results of Dharmesh [1] which was discussed in Section 2.

Acc (%)
Model Feature ESC10 ESC50 UrbanSound8K
PiczakCNN [19] Mels 80.5 64.9 72.7
D-CNN [37] Mels - 68.1 81.9
SoundNet [2] - 92.1 74.2 -
Envnet-v2 [29] Raw data 91.4 84.9 78.3
Mels 88.7 76.8 74.7
proposedCNN
GTs 89.2 78.9 77.4
Mels 90.2 79.2 77.3
proposedCNN + mixup
GTs 90.7 80.7 79.8
Mels 91.3 82.5 82.6
proposedCNN + aug + mixup
GTs 91.7 83.9 83.7
human performance - 95.7 81.3 -
Dharmesh [1] GTs - 79.10 85.34

accuracy of our methods with Mels outperformed PiczakCNN [19] (baseline) by


10.8%, 17.6%, 9.9% on ESC-10, ESC-50 and UrbanSound8K datasets, respec-
tively. Data augmentation is an important technique for increasing performance
for limited dataset, which gave an improvement of 1.1%, 3.3% and 5.3% on
ESC-10, ESC-50 and UrbanSound8K, respectively. In addition, GTs improved
by 0.4%, 1.4% and 1.1% over Mels on ESC-10, ESC-50 and UrbanSound8K,
respectively. We can see that classification accuracy with GTs is always better
than accuracy with Mels on on ESC-10, ESC-50 and UrbanSound8K datasets,
which indicates that feature representation is a critical factor for classification
performance. What’s more, mixup is a powerful way to improve performance
which can always perform better results than that without mixup. In our ex-
periments, Mixup gave an improvement of 1.5%, 2.4% and 2.6% with Mels on
ESC-10, ESC-50, UrbanSound8k datasets, respectively. As mentioned in Section
3, mixup trains a network using a linear combination of training examples and
their labels and leads to a regularization for neural network and generalization
for unseen data. For the effect of mixup, we do a further exploration in the
following parts.

5.1 Comparison of network architecture

We compare our proposed CNN with a VGG network architecture with same
depth of network. This VGG network has same network parameters with our
proposed CNN except for replacing to use 3x3 convolution filters and 2x2 stride
pooling and we refer to this architecture as VGG10. In Table 4, we provide
classification accuracy of proposedCNN and VGG10 on ESC-10, ESC-50 and
(a) (b)

Fig. 2. Training curves of our proposed CNN on (a) ESC-50 and (b) UrbanSound8K
datasets.

Table 4. Comparison between proposed CNN and VGG10 Net (%).

Model ESC-10 ESC-50 UrbanSound8K


proposedCNN 88.7 76.8 74.7
VGG10 87.5 73.3 73.2

UrbanSound8K datasets. The results shows that our proposed CNN always per-
forms better than VGG10 on three datasets.

5.2 Effects of Mixup

Analysis. The confusion matrix by the proposed CNN with Mels and mixup for
the UrbanSound8K dataset is given in Fig.3 (a). We can observe that the most
misrecognition happened between two noise-like classes, such as jackhammer and
drilling, engine idling and jackhammer, and air conditioner and engine idling.
In Fig.3 (b), we provide the difference of the confusion for the proposed CNN
method with and without mixup. We see that mixup gives an improvement
for most classes, especially for air conditioner, drilling, jackhammer and siren.
However, mixup also has a slightly harmful effect on the accuracy for some
classes and increases confusion between some specific pairs classes. For example,
although mixup reduces the confusion between jackhammer and engine idling,
it increases the confusion between jackhammer and siren.
To gain further insights to the effect of mixup, we visualized the feature
distributions for UrbanSound8K with mixup and without mixup using PCA in
Fig.4. The feature dots represent the high-level feature vectors obtained at the
output of the first fully connected layer (FC1). We can observe that it is quite
different between feature distributions with and without mixup. Fig.4 (a) shows
the feature distributions of different classes with mixup. Some classes have a
large within-class variance of the feature distribution, while some have a small
within-class variance. In addition, the between-class distances of different pairs
(a) (b)

Fig. 3. (a) Confusion matrix for UrbanSound8K dataset using the proposed CNN
model applying to Mels with mixup augmentation methods. (b) Different between
the confusion matrix for UrbanSound8K dataset using the proposed CNN and Mels
with mixup and without mixup: the negative values (brown) mean the confusion is
decreased with mixup, the positive (blue) values mean the confusion is increased with
mixup. Classes are air conditioner (AI), car horn (CA), children playing (CH), dog
barking (DO), drilling (DR), engine idling (EN), gun shot (GU), jackhammer (JA),
siren (SI) and street music (ST).

(a) (b)

Fig. 4. Visualization of the feature distribution at the output of FC1 using PCA (a)
without mixup and (b) with mixup.
Fig. 5. Curves of an accuracy with different α for ESC-10, ESC-50, UrbanSound8K

of classes are also varied, which may make models more sensitive to some classes.
However, features of most classes distribute within a small space with a relative
smaller within-class variance and the boundary of most classes is clear as shown
in Fig.4(b).
Hyper-parameter α selected. In order to achieve a better performance for
our system on ESC, the effect of mixup hyper-parameter α is further explored.
Fig.5 shows the change of accuracy with different α ranging from [0.1, 0.5]. We
see that when α = 0.2, the best accuracy is achieved on all three datasets.

6 Conclusion
In this paper, we proposed a novel deep convolutional neural network architec-
ture for environmental sound classification. We compared our proposed CNN
with VGG10 and results showed that our proposed CNN always performed bet-
ter. To further improve the classification accuracy, mixup was applied in our
ESC system. As a result, the proposed ESC system achieved state-of-the-art
performance on UrbanSound8K dataset and competitive performance on ESC-
10 and ESC-50 dataset. Furthermore, we explored the impacts of mixup on
the classification accuracy and feature space distribution of different classes on
UrbanSound8K dataset. The results showed that mixup is a powerful method
to improves classification accuracy. Our future work will focus on the network
design and exploration for using mixup method for specific classes.

References
1. Agrawal, D.M., Sailor, H.B., Soni, M.H., Patil, H.A.: Novel teo-based gammatone
features for environmental sound classification. In: Signal Processing Conference
(EUSIPCO), 2017 25th European. pp. 1809–1813. IEEE (2017)
2. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations
from unlabeled video. In: Advances in Neural Information Processing Systems. pp.
892–900 (2016)
3. Barchiesi, D., Giannoulis, D., Dan, S., Plumbley, M.D.: Acoustic scene classifica-
tion: Classifying environments from the sounds they produce. IEEE Signal Pro-
cessing Magazine 32(3), 16–34 (2015)
4. Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-
based music information retrieval: Current directions and future challenges. Pro-
ceedings of the IEEE 96(4), 668–696 (2008)
5. Chu, S., Narayanan, S., Kuo, C.C.J.: Environmental sound recognition with time-
frequency audio features. Institute of Electrical and Electronics Engineers Inc.,
The (2009)
6. Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks
for raw waveforms. In: Acoustics, Speech and Signal Processing (ICASSP), 2017
IEEE International Conference on. pp. 421–425. IEEE (2017)
7. Eronen, A.J., Peltonen, V.T., Tuomi, J.T., Klapuri, A.P., Fagerlund, S., Sorsa, T.,
Lorho, G., Huopaniemi, J.: Audio-based context recognition. IEEE Transactions
on Audio, Speech, and Language Processing 14(1), 321–329 (2006)
8. Geiger, J.T., Helwani, K.: Improving event detection for audio surveillance using
gabor filterbank features. In: Signal Processing Conference. pp. 714–718 (2015)
9. Graves, A., Mohamed, A.r., Hinton, G.: Speech recognition with deep recurrent
neural networks. In: Acoustics, speech and signal processing (icassp), 2013 ieee
international conference on. pp. 6645–6649. IEEE (2013)
10. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups. IEEE
Signal Processing Magazine 29(6), 82–97 (2012)
11. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift pp. 448–456 (2015)
12. Kons, Z., Toledo-Ronen, O.: Audio event classification using deep neural networks.
In: Interspeech. pp. 1482–1486 (2013)
13. Lee, K., Ellis, D.P.: Audio-based semantic concept classification for consumer video.
IEEE Transactions on Audio, Speech, and Language Processing 18(6), 1406–1416
(2010)
14. Lyon, R.F.: Machine hearing: An emerging field [exploratory dsp]. Signal Process-
ing Magazine IEEE 27(5), 131–139 (2010)
15. McLoughlin, I., Zhang, H., Xie, Z., Song, Y., Xiao, W.: Robust sound event classi-
fication using deep neural networks. IEEE/ACM Transactions on Audio, Speech,
and Language Processing 23(3), 540–552 (2015)
16. McLoughlin, I.V.: Line spectral pairs. Signal processing 88(3), 448–467 (2008)
17. Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T.,
Plumbley, M.D.: Detection and classification of acoustic scenes and events: Out-
come of the dcase 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and
Language Processing 26(2), 379–393 (2018)
18. Mun, S., Shon, S., Kim, W., Han, D.K., Ko, H.: Deep neural network based learn-
ing and transferring mid-level audio features for acoustic scene classification. In:
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Con-
ference on. pp. 796–800. IEEE (2017)
19. Piczak, K.J.: Environmental sound classification with convolutional neural net-
works. In: IEEE International Workshop on Machine Learning for Signal Process-
ing. pp. 1–6 (2015)
20. Piczak, K.J.: Esc: Dataset for environmental sound classification. In: ACM Inter-
national Conference on Multimedia. pp. 1015–1018 (2015)
21. Radhakrishnan, R., Divakaran, A., Smaragdis, P.: Audio analysis for surveillance
applications. In: Applications of Signal Processing to Audio and Acoustics, 2005.
IEEE Workshop on. pp. 158–161 (2005)
22. Sainath, T.N., Mohamed, A.r., Kingsbury, B., Ramabhadran, B.: Deep convo-
lutional neural networks for lvcsr. In: Acoustics, speech and signal processing
(ICASSP), 2013 IEEE international conference on. pp. 8614–8618. IEEE (2013)
23. Salamon, J., Bello, J.: Deep convolutional neural networks and data augmentation
for environmental sound classification. IEEE Signal Processing Letters PP(99),
1–1 (2016)
24. Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound
research. In: Proceedings of the 22nd ACM international conference on Multimedia.
pp. 1041–1044. ACM (2014)
25. Schedl, M., Gómez, E., Urbano, J., et al.: Music information retrieval: Recent
developments and applications. Foundations and Trends R in Information Retrieval
8(2-3), 127–261 (2014)
26. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. Computer Science (2014)
27. Takahashi, N., Gygli, M., Pfister, B., Van Gool, L.: Deep convolutional neu-
ral networks and data augmentation for acoustic event detection. arXiv preprint
arXiv:1604.07160 (2016)
28. Temko, A., Monte, E., Nadeu, C.: Comparison of sequence discriminant support
vector machines for acoustic event classification. In: Acoustics, Speech and Signal
Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference
on. vol. 5, pp. V–V. IEEE (2006)
29. Tokozume, Y., Harada, T.: Learning environmental sounds with end-to-end con-
volutional neural network. In: Acoustics, Speech and Signal Processing (ICASSP),
2017 IEEE International Conference on. pp. 2721–2725. IEEE (2017)
30. Tokozume, Y., Ushiku, Y., Harada, T.: Learning from between-class examples for
deep sound recognition. arXiv preprint arXiv:1711.10282 (2018)
31. Typke, R., Wiering, F., Veltkamp, R.C.: A survey of music information retrieval
systems. In: Proc. 6th International Conference on Music Information Retrieval.
pp. 153–160. Queen Mary, University of London (2005)
32. Uzkent, B., Barkana, B.D., Cevikalp, H.: Non-speech environmental sound classi-
fication using svms with a new set of features. International Journal of Innovative
Computing, Information and Control 8(5), 3511–3524 (2012)
33. Vacher, M., Serignat, J.F., Chaillol, S.: Sound classification in a smart room envi-
ronment: an approach using gmm and hmm methods. Sped 1 (2014)
34. Valero, X., Alias, F.: Gammatone cepstral coefficients: Biologically inspired fea-
tures for non-speech audio classification. IEEE Transactions on Multimedia 14(6),
1684–1689 (2012)
35. Zhang, H., Mcloughlin, I., Song, Y.: Robust sound event recognition using convo-
lutional neural networks. In: IEEE International Conference on Acoustics, Speech
and Signal Processing. pp. 559–563 (2015)
36. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk
minimization. arXiv preprint arXiv:1710.09412 (2017)
37. Zhang, X., Zou, Y., Shi, W.: Dilated convolution neural network with leakyrelu
for environmental sound classification. In: Digital Signal Processing (DSP), 2017
22nd International Conference on. pp. 1–5. IEEE (2017)

You might also like