Spectrogram Transformers For Audio Classification
Spectrogram Transformers For Audio Classification
Items in Figshare are protected by copyright, with all rights reserved, unless otherwise indicated.
[Link]
PUBLISHER
IEEE
VERSION
AM (Accepted Manuscript)
PUBLISHER STATEMENT
© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other
uses, in any current or future media, including reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of
any copyrighted component of this work in other works.
LICENCE
REPOSITORY RECORD
Zhang, Yixiao, Baihua Li, Hui Fang, and Qinggang Meng. 2022. “Spectrogram Transformers for Audio
Classification”. Loughborough University. [Link]
Spectrogram Transformers for Audio Classification
Yixiao Zhang, Baihua Li∗ , Hui Fang, Qinggang Meng
Department of Computer Science
Loughborough University
Loughborough, U.K.
{Y.Zhang8, [Link], [Link], [Link]}@[Link]
Abstract—Audio classification is an important task in the In this paper, to further design an effective and efficient
machine learning field with a wide range of applications. Since the transformer network architecture, we propose the Spectro-
last decade, deep learning based methods have been widely used gram Transformers, a novel family of Transformer models,
and the transformer-based models are becoming new paradigm
for audio classification. In this paper, we present Spectrogram for audio classification task. Specifically, we introduce two
Transformers, which are a group of transformer-based models for sampling methods to extract features from audio spectro-
audio classification. Based on the fundamental semantics of audio gram, which include time-dimension sampling and frequency-
spectrogram, we design two mechanisms to extract temporal dimension sampling. These features are further used in four
and frequency features from audio spectrogram, named time- variants of architectures named Temporal Only (TO) Trans-
dimension sampling and frequency-dimension sampling. These
discriminative representations are then enhanced by various former, Temporal-Frequency Sequential (TFS) Transformer,
combinations of attention block architectures, including Tempo- Temporal-Frequency Parallel (TFP) Transformer, and Two-
ral Only (TO) attention, Temporal-Frequency sequential (TFS) stream Temporal-Frequency (TSTF) Transformer. We test our
attention, Temporal-Frequency Parallel (TFP) attention, and models and compare them to the state-of-the-art methods on
Two-stream Temporal-Frequency (TSTF) attention, to extract ESC-50 dataset, which is a standard dataset for the audio
the sound record signatures to serve the classification task. Our
experiments demonstrate that these Transformer models outper- classification task.
form the state-of-the-art methods on ESC-50 dataset without The contributions of this paper can be summarised as
pre-training stage. Furthermore, our method also shows great follows: first, current transformer networks are based on ViT
efficiency compared with other leading methods. architecture [20] which uses the patches cropped from spectro-
Keywords—Transformer, Spectrogram, Audio representation, gram as input. We are the first method to investigate different
Audio classification
sampling and embedding generalisation methods since our fea-
ture partitions are more meaningful from a domain knowledge
I. I NTRODUCTION perspective. Secondly, our Transformer framework achieved
Sound is a crucial signifier which contains rich high-level 11.89% performance uplifted compared to the state-of-the-
semantic environmental information. Consequently, comput- art method “AST” [9] on accuracy when processing ESC-50
erised audio classification, aiming to recognise a various of dataset without pre-training. Thirdly, our architectures have
sound patterns, has been developed for decades [1]. It is still gained large margins in model efficiency.
one of the most important tasks in machine learning which is The rest of our paper is organized as follows: Section II
driven by a wide range of real-world applications, including introduces related work for audio classification task. Section
surveillance [2], monitoring [3], intelligent fault diagnosis of III describes our novel Spectrogram Transformer architectures.
machines for safety [4], and animals detection for nature Section IV presents our experimental results compared to other
reserve protection [5]. methods. Section V draws the conclusion and discusses our
Deep learning based models are the most popular methods future work.
used for the audio classification [6–9]. There are many stud- II. R ELATED W ORK
ies using CNN architectures to adapt pre-trained models on
audio representations, e.g., Spectrograms [10–12] and Mel- Time-frequency representations are the most common mid-
Frequency Cepstral Coefficients (MFCC) [13–15], to sig- level features used for audio processing. A majority of studies
nificantly improve the performance of audio classification, project raw audio signals into the time-frequency space before
tagging, and recognition tasks [16, 17]. Recently, transformer the analysis [21–23]. Among these features, Spectrograms
based models [9, 18, 19] have been emerging to further boost [10] and Mel-Frequency Cepstral Coefficient (MFCC) features
the performance. Transformer models enable to model long [13] are the most representative since the 2D form allows
feature dependencies and support parallel processing. Since the exploration from recent DNN methods on the interactions
their success on natural language processing tasks, they have between temporal and frequency dimensions.
great potential to model the time series signals, i.e. audio, in CNN-based methods are primarily used when analyzing the
our work. audio signals during this decade. Many audio classification
models deploy standard image classification models, e.g.,
*Corresponding author Inception, ResNet, and VGG [6, 17] on the time-frequency
Fig. 1. Pipeline of proposed method. Firstly, audio spectrogram is extracted from audio waveform. Then, time-dimension embedding and frequency-dimension
embedding are generated using the proposed sampling methods. Finally, the proposed transformers are used to give classification prediction.
features. To further improve the performance, several new A. Spectrogram Transformer Framework
architecture were designed in order to enhance the audio The processing pipeline of our system is depicted in Fig. 1.
feature extraction and modelling. AclNet [24] presented a When the audio waveform segments are input into the system,
VGG architecture with drastically reduced memory to reach they are converted to spectrogram images. Compared to the
a trade-off between accuracy and complexity. SeCoST [25] original audio waveform which is 1D signal, the conversion
introduced an audio segment level predictions for its classifi- could potentially boost the DNN performance by exploring the
cation. While ERANNs [26] proposed efficient residual audio interactions between temporal and frequency features. In our
neural networks for audio pattern recognition. work, we generate 128-dimensional log Mel filter-bank energy
Recently, the attention mechanism and transformer models features in our system. Specifically, the target sampling rate
are introduced to improve the audio classification with global from audio waveform is 16,000 per second. We use 400 as the
contextual feature awareness. Attention-augmented convolu- length of the Fast Fourier Transform (FFT) window and 160 as
tional neural network [7] was proposed to enhance the au- the stride step for the sampling. For each second audio, a 100-
dio features by exploring the relationship between various dimensional feature in time domain are generated. The input
frequency bands. While [27] discussed to utilise a temporal feature of our network is Z ∈ R128×100t where t is the length
attention from energy changes over time to improve the repre- of input audio in second. Subsequently, we use time-dimension
sentations. In audio classification task, compared with CNN- sampling method and frequency-dimension sampling method
based methods, one of the advantages of using transformer is (details are introduced in the following subsection) to generate
these methods support variance of input length since the length time-dimension embeddings Et ∈ R100t×768 and frequency-
of input sequence does not affect the number of parameters in dimension embeddings Ef ∈ R128×768 . Similar to the ViT
multi-head self-attention or transformer block. When changing [20], we append a learnable classification (CLS) token CLS ∈
the length of input audio, transformer-based method can still R1×768 to the embeddings for classification task. Since trans-
capture useful global context information promptly. AST [9] former has no access to the sequential information, we also
is the first transformer model for audio classification which add learnable positional embeddings Ept ∈ R(100t+1)×768
uses the architecture of the image classification network ViT on the time-dimension embeddings or Epf ∈ R129×768 for
[20] and adapts the pre-training weights from ViT. PaSST [19] frequency-dimension embeddings in our work. Finally, we
is another method in the leading board which significantly input the sequence Ept ∈ R(100t+1)×768 or Epf ∈ R129×768
reduces computation and memory complexity of training trans- to transformer blocks for the classification task.
formers for the audio domain.
B. Time-dimension sampling and frequency-dimension sam-
The distinctive merits of our model compared to other trans-
pling
former models are: (i) we propose a new sampling strategy to
extract attentions from audio spectrograms; and (ii) we de- As illustrated in Fig. 2, we propose two sampling methods
sign temporal multi-head self-attention module and frequency to obtain both temporal and frequency features from spec-
multi-head self-attention module to further investigate effective trograms for our transformer models, named time-dimension
architectures which integrate these two attentions for better sampling and frequency-dimension sampling. Different from
performance. processing a normal 2D image which has two spatial dimen-
sions, the spectrograms present time and frequency dimensions
where we believe it is a more meaningful way to sample
III. T HE S PECTROGRAM T RANSFORMERS them separately for exploring the contextual attentions in our
transformer models. Thus, we design time-dimension sampling
In this section, we explain the details of our proposed spec- method which generates embedding Et ∈ R100t×768 based
trogram transformers. We firstly present our overall system on the vectors zn t ∈ R
1×128
, n = 1...100t that match
pipeline followed by the two sampling mechanisms to extract to sliding FFT windows at each timestamp n. While for
features which are used in our attention blocks. Then, we the frequency-dimension sampling method, we produce the
introduce the four variants of the transformer architectures and frequency-dimension embedding Ef ∈ R128×768 from zm f ∈
their design logic. R100t×1 , m = 1...128.
putation, we propose frequency multi-head self-attention in our
transformer block to expand the original frequency features
from 128D to 768D via a linear projection layer. Since the
temporal features are more reliable compared to the frequency
features, we use the temporal multi-head self-attention module
before the frequency multi-head self-attention module in the
architecture.
3) Model3: Temporal-Frequency Parallel (TFP) Trans-
former: We make another design to explore the ensemble
effect of using both the temporal attention and frequency
attention to enhance the time-dimension embeddings. In this
architecture, we use temporal multi-head self-attention and
frequency multi-head self-attention in parallel with a residual
connection as illustrated in Fig. 3.
4) Model4: Two-stream Temporal-Frequency (TSTF) Trans-
former: The fourth model architecture is a two-stream struc-
ture. There are two differences between TSTF Transformer
with previous TFP Transformer and TFS Transformer: (i) we
design two individual pipelines to enhance both the time-
Fig. 2. The sampling methods used for Spectrogram Transformer. For a t- dimension embedding and frequency embedding via self-
second audio segment, a 128 × t dimension spectrogram could be generated. attention moduels before integrate the two features via an
128 refers to frequency bins and t refers to sliding FFT windows on time.
(a) Time-dimension sampling: By cutting through time-dimension, t time-
MLP head; and (ii) compared to TFS Transformer and TFP
dimension embedding with shape 128 × 1 could be generated. (b) Frequency- Transformer which fuse the temporal information and fre-
dimension sampling: By cutting through frequency-dimension, 128 frequency- quency information in each transformer block, the two-stream
dimension embedding with shape 1 × t could be generated.
Transformer integrate them till the last step.
IV. E MPIRICAL EVALUATION
C. Transformer architectures A. Experimental Setup
In our system, we investigate a set of combinations of Dataset We evaluate the performance of our proposed
the use of temporal and frequency attentions to improve the models on a audio classification dataset ESC-50 [29]. The
system performance. As illustrated in Fig. 3, we design four ESC-50 dataset is a single-label dataset which consists of 2000
transformer-based architectures for audio classification. We environmental audio recordings. Each audio recording in ESC-
firstly start from the simplest model Temporal Only (TO) 50 dataset is 5-second long. There are 50 semantical classes
Transformer which uses a temporal multi-head self-attention. loosely arranged into 5 major categories, including Animals,
Then, we add a frequency multi-head self-attention into the Natural soundscapes & water sounds, Human, non-speech
TO model sequentially and in parallel respectively which sounds, Interior/domestic sounds, and Exterior/urban noises.
are named Temporal-Frequency Sequential (TFS) Transformer The dataset is a balanced dataset which has 40 examples for
and Temporal-Frequency Parallel (TFP) Transformer. Finally, each class. We use 5-fold cross-validation for the experiments
we have the Two-stream Temporal-Frequency (TSTF) Trans- where the folds are prearranged in the dataset[29].
former which fuses the attentions before feeding them into Training details The spectrogram features are extracted
a MLP classification head. We use 12 heads self-attention from audio recordings before training the transformer models.
modules and stack six layers of transformer blocks in each For each 5 seconds audio, we extract 128-dimensional log
transformer encoder. Mel filterbank energy features Z ∈ R128×500 using the setting
1) Model1: Temporal Only (TO) Transformer: Since audio described in Section III-A. We use SpecAugment [30] for data
is time series data, the sequential information of features are argumentation. The idea of SpecAugment is using a random
extremely important. Thus, We design the “TO” Transformer length of mask to filter out blocks of frequency channels and
for audio spectrograms to capture the attentions between the time windows. The maximum length of the frequency mask is
features at temporal space. Simply put, our “TO” transformer 24 while the maximum length of frequency mask is 96. The
implements the original multi-head self-attention [28] on the time-dimension embedding Et ∈ R500×768 and frequency-
dimension of time. dimension embedding Ef ∈ R128×768 could be computed by
2) Model2: Temporal-Frequency Sequential (TFS) Trans- two linear embedding layers based on the masked spectrogram
former: Inspired by Wu et al. [7], the attention from frequency features. Finally, we input embeddings Ept ∈ R501×768 and
bands also contribute to improve the classification accuracy. Epf ∈ R129×768 into our different model architectures.
In this architecture, we explore to use the frequency attention Notably, most transformer-based audio classification meth-
and the temporal attention to further enhance the temporal ods are strongly rely on pre-training on ImageNet, e.g., AST
feature. To diversify frequency features for the attention com- [9], and PaSST [19]. Transformer-based method shows a huge
Fig. 3. The 4 Spectrogram Transformer architectures: (a) Temporal Only Transformer, (b) Temporal-Frequency Sequential Transformer, (c) Temporal-Frequency
Parallel Transformer, and (d) Two-stream Temporal-Frequency Transformer. L refers to the number of transformer blocks.
We train our model using batch size 28 with SGD optimizer Method Top-1 Accuracy
(momentum 0.9 and weight decay 1e-4) and cross-entropy loss
KNN [31] 32.20
in our experiments. Our model is trained for 50 epochs with
an initial learning rate 1e-2 which is decayed to 1e-3 after SVM [31] 39.60
30th epoch. Convolutional Autoencoder [32] 39.90
We choose the champion of the ESC-50 leaderboard method AST [9] 45.35
AST [9] for comparison. The training code used is from Proposed TSTF Transformer 57.24
their GitHub repository without any change, except slightly
reducing batch size to fit our GPU.
TABLE II
P ERFORMANCE COMPARISON OF S PECTROGRAM T RANSFORMER MODELS
B. Comparison to state-of-the-art methods AND AST ON ESC-50 DATASET. T OP -1 ACCURACY IS PRESENTED .
C. Ablation studies
We perform an empirical study to understand the perfor- on ESC-50 without pre-training which outperforms state-of-
mance of our proposed architectures of Spectrogram Trans- the-art method AST by 11.82%. Temporal-Frequency Se-
formers when compared to the state-of-the-art transformer- quential (TFS) Transformer performs a slightly better result
based method AST [9]. The findings are summarised as below: than AST at 48.09%, whereas another temporal-frequency
Model Accuracy We firstly conduct the experiments to model Temporal-Frequency Parallel (TFP) Transformer shows
evaluate the model accuracy shown in Table II. The Tempo- a higher accuracy at 52.88%. The fourth architecture Two-
ral Only (TO) Transformer achieves 57.17% top-1 accuracy stream Temporal-Frequency (TSTF) Transformer is the best
variant which achieves 57.24% top-1 accuracy and outper- sampling and frequency-dimension sampling, and four trans-
forms state-of-the-art method AST by 11.89%. former architectures for the audio classification task. All
From Table II, we find that Temporal-Frequency Serial these architectures achieved state-of-the-art results on ESC-
(TFS) Transformer and Temporal-Frequency Parallel (TFP) 50 dataset when using transformer-based method, with their
Transformer which study frequency information in the trans- own distinctive advantages, either on accuracy or efficiency
former block rather than in an individual stream are not improvement. The main limitation is that there is still a
perform as good as Two-stream Temporal-Frequency (TSTF) significant performance gap when compared to the transformer
Transformer. This may be caused by the loss of positional with pre-trained model. In future work, we will investigate to
information of frequency. In TFS Transformer and TFP Trans- integrate pre-trained model in our transformer architectures to
former, unlike to temporal dimension, there is no positional improve the accuracy.
embedding for the frequency dimension. The model is dif-
ACKNOWLEDGEMENT
ficult to capture the sequential information in the frequency
dimension. The authors would like to thank China Scholarship Council
and Loughborough University for supporting his study.
TABLE III R EFERENCES
C OMPARISON OF OUR MODELS WITH AST IN COST. MSA REFERS TO
MULTI - HEAD SELF - ATTENTION
[1] Y. Rui, A. Gupta, and A. Acero, “Automatically extracting highlights for
tv baseball programs,” in Proceedings of the eighth ACM international
conference on Multimedia, 2000, pp. 105–115.
Model Layers No. of MSA GFLOPs Params [2] N. Almaadeed, M. Asim, S. Al-Maadeed, A. Bouridane, and A. Begh-
AST [9] 12 12 49.40 86.86 M dadi, “Automatic detection and classification of audio events for road
surveillance applications,” Sensors, vol. 18, no. 6, p. 1858, 2018.
TO 6 6 21.64 43.05 M [3] R. V. Sharan and T. J. Moir, “An overview of applications and advance-
TFS 6 12 27.12 85.56 M ments in automatic sound recognition,” Neurocomputing, vol. 200, pp.
22–34, 2016.
TFP 6 12 27.12 85.56 M [4] F. Jia, Y. Lei, L. Guo, J. Lin, and S. Xing, “A neural network constructed
TSTF 6 12 23.46 57.21 M by deep learning technique and its application to intelligent fault
diagnosis of machines,” Neurocomputing, vol. 272, pp. 619–628, 2018.
[5] D. Stowell, M. Wood, Y. Stylianou, and H. Glotin, “Bird detection
in audio: a survey and a challenge,” in 2016 IEEE 26th International
Model efficiency We observe that our model has a lower Workshop on Machine Learning for Signal Processing (MLSP). IEEE,
inference cost as well as fewer parameters. Table III presents 2016, pp. 1–6.
the comparison of model capacity between AST model and [6] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen,
R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al.,
our models. “Cnn architectures for large-scale audio classification,” in 2017 ieee
Our Temporal Only (TO) Transformer has the most similar international conference on acoustics, speech and signal processing
architecture with AST model. We use 6-layer architecture for (icassp). IEEE, 2017, pp. 131–135.
[7] Y. Wu, H. Mao, and Z. Yi, “Audio classification using attention-
Temporal Only (TO) Transformer. As the result, the number augmented convolutional neural network,” Knowledge-Based Systems,
of parameters of TO Transformer is about 50% of the AST vol. 161, pp. 90–100, 2018.
model. We also benefit from time-dimension sampling method [8] H. Liang and Y. Ma, “Acoustic scene classification using attention-based
convolutional neural network,” DCASE2019 Challenge, Tech. Rep., Jun.
which leads less the number of embedding, the FLOPs of TO 2019.[Online]. Available: [Link] . . . , Tech. Rep., 2019.
Transformer is only about 43.8% of the AST model. [9] Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram trans-
Compared with TO Transformer, both TFS Transformer former,” arXiv preprint arXiv:2104.01778, 2021.
[10] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,”
and TFP Transformer have two multi-head self-attentions in in 2014 IEEE International Conference on Acoustics, Speech and Signal
the transformer block instead of one. The FLOPs of these Processing (ICASSP). IEEE, 2014, pp. 6964–6968.
Transformers are slightly increased, while the number of [11] X. Zhang, Y. Zou, and W. Shi, “Dilated convolution neural network
with leakyrelu for environmental sound classification,” in 2017 22nd
parameters nearly doubled. international conference on digital signal processing (DSP). IEEE,
For Two-stream Temporal-Frequency (TSTF) Transformer, 2017, pp. 1–5.
we also use 6-layer architecture. Due to the sequence length of [12] Z. Chi, Y. Li, and C. Chen, “Deep convolutional neural network
combined with concatenated spectrogram for environmental sound clas-
frequency stream 129 is significantly lower than the sequence sification,” in 2019 IEEE 7th International Conference on Computer
length of temporal stream 501 and the algorithm complexity of Science and Network Technology (ICCSNT). IEEE, 2019, pp. 251–
multi-head self-attention is O(n2 d + nd2 ), the computational 254.
[13] A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic
cost of the frequency stream is much less than the temporal scene classification and sound event detection,” in 2016 24th European
stream and the computational cost of 6-layer TSTF Trans- Signal Processing Conference (EUSIPCO). IEEE, 2016, pp. 1128–
former (12 multi-head self-attentions in total) is much less 1132.
[14] Z. Mushtaq and S.-F. Su, “Environmental sound classification using a
than 12-layer Temporal Only (TO) Transformer. regularized deep convolutional neural network with data augmentation,”
Applied Acoustics, vol. 167, p. 107389, 2020.
V. C ONCLUSION & F UTURE WORK [15] D. M. Agrawal, H. B. Sailor, M. H. Soni, and H. A. Patil, “Novel teo-
based gammatone features for environmental sound classification,” in
In this work, we proposed spectrogram transformers to im- 2017 25th European Signal Processing Conference (EUSIPCO). IEEE,
2017, pp. 1809–1813.
prove the performance of audio classification. Specifically, we [16] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Esresnet: Environmental
design two sampling mechanisms, including time-dimension sound classification based on visual domain models,” in 2020 25th
International Conference on Pattern Recognition (ICPR). IEEE, 2021,
pp. 4933–4940.
[17] K. Palanisamy, D. Singhania, and A. Yao, “Rethinking cnn models for
audio classification,” arXiv preprint arXiv:2007.11154, 2020.
[18] L. Pepino, P. Riera, and L. Ferrer, “Study of positional encod-
ing approaches for audio spectrogram transformers,” arXiv preprint
arXiv:2110.06999, 2021.
[19] K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Effi-
cient training of audio transformers with patchout,” arXiv preprint
arXiv:2110.05069, 2021.
[20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans-
formers for image recognition at scale,” in International Conference on
Learning Representations, 2021.
[21] A. Rakotomamonjy and G. Gasso, “Histogram of gradients of time–
frequency representations for audio scene classification,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1,
pp. 142–153, 2014.
[22] F. Lieb and H.-G. Stark, “Audio inpainting: Evaluation of time-frequency
representations and structured sparsity approaches,” Signal Processing,
vol. 153, pp. 291–299, 2018.
[23] M. Huzaifah, “Comparison of time-frequency representations for en-
vironmental sound classification using convolutional neural networks,”
arXiv preprint arXiv:1706.07156, 2017.
[24] J. J. Huang and J. J. A. Leanos, “Aclnet: efficient end-to-end audio
classification cnn,” arXiv preprint arXiv:1811.06669, 2018.
[25] A. Kumar and V. K. Ithapu, “Secost:: Sequential co-supervision for large
scale weakly labeled audio event detection,” in ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2020, pp. 666–670.
[26] S. Verbitskiy, V. Berikov, and V. Vyshegorodtsev, “Eranns: Efficient
residual audio neural networks for audio pattern recognition,” arXiv
preprint arXiv:2106.01621, 2021.
[27] X. Li, V. Chebiyyam, and K. Kirchhoff, “Multi-stream network
with temporal attention for environmental sound classification,” arXiv
preprint arXiv:1901.08608, 2019.
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in neural information processing systems, 2017, pp. 5998–6008.
[29] K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,”
in Proceedings of the 23rd Annual ACM Conference on
Multimedia. ACM Press, pp. 1015–1018. [Online]. Available:
[Link]
[30] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,
and Q. V. Le, “Specaugment: A simple data augmentation method for
automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
[31] K. J. Piczak, “Esc: Dataset for environmental sound classification,” in
Proceedings of the 23rd ACM international conference on Multimedia,
2015, pp. 1015–1018.
[32] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound
representations from unlabeled video,” Advances in neural information
processing systems, vol. 29, 2016.