Spectrogram Transformers For Audio Classification

Uploaded by

ehsannemesis6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views7 pages

Spectrogram Transformers For Audio Classification

Uploaded by

ehsannemesis6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

This item was submitted to Loughborough's Research Repository by the author.

Spectrogram transformers for audio classification

PLEASE CITE THE PUBLISHED VERSION

[Link]

PUBLISHER

IEEE

VERSION

AM (Accepted Manuscript)

PUBLISHER STATEMENT

© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other
uses, in any current or future media, including reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of
any copyrighted component of this work in other works.

LICENCE

REPOSITORY RECORD

Zhang, Yixiao, Baihua Li, Hui Fang, and Qinggang Meng. 2022. “Spectrogram Transformers for Audio
Classification”. Loughborough University. [Link]
Spectrogram Transformers for Audio Classification
Yixiao Zhang, Baihua Li∗ , Hui Fang, Qinggang Meng
Department of Computer Science
Loughborough University
Loughborough, U.K.
{Y.Zhang8, [Link], [Link], [Link]}@[Link]

Abstract—Audio classification is an important task in the In this paper, to further design an effective and efficient
machine learning field with a wide range of applications. Since the transformer network architecture, we propose the Spectro-
last decade, deep learning based methods have been widely used gram Transformers, a novel family of Transformer models,
and the transformer-based models are becoming new paradigm
for audio classification. In this paper, we present Spectrogram for audio classification task. Specifically, we introduce two
Transformers, which are a group of transformer-based models for sampling methods to extract features from audio spectro-
audio classification. Based on the fundamental semantics of audio gram, which include time-dimension sampling and frequency-
spectrogram, we design two mechanisms to extract temporal dimension sampling. These features are further used in four
and frequency features from audio spectrogram, named time- variants of architectures named Temporal Only (TO) Trans-
dimension sampling and frequency-dimension sampling. These
discriminative representations are then enhanced by various former, Temporal-Frequency Sequential (TFS) Transformer,
combinations of attention block architectures, including Tempo- Temporal-Frequency Parallel (TFP) Transformer, and Two-
ral Only (TO) attention, Temporal-Frequency sequential (TFS) stream Temporal-Frequency (TSTF) Transformer. We test our
attention, Temporal-Frequency Parallel (TFP) attention, and models and compare them to the state-of-the-art methods on
Two-stream Temporal-Frequency (TSTF) attention, to extract ESC-50 dataset, which is a standard dataset for the audio
the sound record signatures to serve the classification task. Our
experiments demonstrate that these Transformer models outper- classification task.
form the state-of-the-art methods on ESC-50 dataset without The contributions of this paper can be summarised as
pre-training stage. Furthermore, our method also shows great follows: first, current transformer networks are based on ViT
efficiency compared with other leading methods. architecture [20] which uses the patches cropped from spectro-
Keywords—Transformer, Spectrogram, Audio representation, gram as input. We are the first method to investigate different
Audio classification
sampling and embedding generalisation methods since our fea-
ture partitions are more meaningful from a domain knowledge
I. I NTRODUCTION perspective. Secondly, our Transformer framework achieved
Sound is a crucial signifier which contains rich high-level 11.89% performance uplifted compared to the state-of-the-
semantic environmental information. Consequently, comput- art method “AST” [9] on accuracy when processing ESC-50
erised audio classification, aiming to recognise a various of dataset without pre-training. Thirdly, our architectures have
sound patterns, has been developed for decades [1]. It is still gained large margins in model efficiency.
one of the most important tasks in machine learning which is The rest of our paper is organized as follows: Section II
driven by a wide range of real-world applications, including introduces related work for audio classification task. Section
surveillance [2], monitoring [3], intelligent fault diagnosis of III describes our novel Spectrogram Transformer architectures.
machines for safety [4], and animals detection for nature Section IV presents our experimental results compared to other
reserve protection [5]. methods. Section V draws the conclusion and discusses our
Deep learning based models are the most popular methods future work.
used for the audio classification [6–9]. There are many stud- II. R ELATED W ORK
ies using CNN architectures to adapt pre-trained models on
audio representations, e.g., Spectrograms [10–12] and Mel- Time-frequency representations are the most common mid-
Frequency Cepstral Coefficients (MFCC) [13–15], to sig- level features used for audio processing. A majority of studies
nificantly improve the performance of audio classification, project raw audio signals into the time-frequency space before
tagging, and recognition tasks [16, 17]. Recently, transformer the analysis [21–23]. Among these features, Spectrograms
based models [9, 18, 19] have been emerging to further boost [10] and Mel-Frequency Cepstral Coefficient (MFCC) features
the performance. Transformer models enable to model long [13] are the most representative since the 2D form allows
feature dependencies and support parallel processing. Since the exploration from recent DNN methods on the interactions
their success on natural language processing tasks, they have between temporal and frequency dimensions.
great potential to model the time series signals, i.e. audio, in CNN-based methods are primarily used when analyzing the
our work. audio signals during this decade. Many audio classification
models deploy standard image classification models, e.g.,
*Corresponding author Inception, ResNet, and VGG [6, 17] on the time-frequency
Fig. 1. Pipeline of proposed method. Firstly, audio spectrogram is extracted from audio waveform. Then, time-dimension embedding and frequency-dimension
embedding are generated using the proposed sampling methods. Finally, the proposed transformers are used to give classification prediction.

features. To further improve the performance, several new A. Spectrogram Transformer Framework
architecture were designed in order to enhance the audio The processing pipeline of our system is depicted in Fig. 1.
feature extraction and modelling. AclNet [24] presented a When the audio waveform segments are input into the system,
VGG architecture with drastically reduced memory to reach they are converted to spectrogram images. Compared to the
a trade-off between accuracy and complexity. SeCoST [25] original audio waveform which is 1D signal, the conversion
introduced an audio segment level predictions for its classifi- could potentially boost the DNN performance by exploring the
cation. While ERANNs [26] proposed efficient residual audio interactions between temporal and frequency features. In our
neural networks for audio pattern recognition. work, we generate 128-dimensional log Mel filter-bank energy
Recently, the attention mechanism and transformer models features in our system. Specifically, the target sampling rate
are introduced to improve the audio classification with global from audio waveform is 16,000 per second. We use 400 as the
contextual feature awareness. Attention-augmented convolu- length of the Fast Fourier Transform (FFT) window and 160 as
tional neural network [7] was proposed to enhance the au- the stride step for the sampling. For each second audio, a 100-
dio features by exploring the relationship between various dimensional feature in time domain are generated. The input
frequency bands. While [27] discussed to utilise a temporal feature of our network is Z ∈ R128×100t where t is the length
attention from energy changes over time to improve the repre- of input audio in second. Subsequently, we use time-dimension
sentations. In audio classification task, compared with CNN- sampling method and frequency-dimension sampling method
based methods, one of the advantages of using transformer is (details are introduced in the following subsection) to generate
these methods support variance of input length since the length time-dimension embeddings Et ∈ R100t×768 and frequency-
of input sequence does not affect the number of parameters in dimension embeddings Ef ∈ R128×768 . Similar to the ViT
multi-head self-attention or transformer block. When changing [20], we append a learnable classification (CLS) token CLS ∈
the length of input audio, transformer-based method can still R1×768 to the embeddings for classification task. Since trans-
capture useful global context information promptly. AST [9] former has no access to the sequential information, we also
is the first transformer model for audio classification which add learnable positional embeddings Ept ∈ R(100t+1)×768
uses the architecture of the image classification network ViT on the time-dimension embeddings or Epf ∈ R129×768 for
[20] and adapts the pre-training weights from ViT. PaSST [19] frequency-dimension embeddings in our work. Finally, we
is another method in the leading board which significantly input the sequence Ept ∈ R(100t+1)×768 or Epf ∈ R129×768
reduces computation and memory complexity of training trans- to transformer blocks for the classification task.
formers for the audio domain.
B. Time-dimension sampling and frequency-dimension sam-
The distinctive merits of our model compared to other trans-
pling
former models are: (i) we propose a new sampling strategy to
extract attentions from audio spectrograms; and (ii) we de- As illustrated in Fig. 2, we propose two sampling methods
sign temporal multi-head self-attention module and frequency to obtain both temporal and frequency features from spec-
multi-head self-attention module to further investigate effective trograms for our transformer models, named time-dimension
architectures which integrate these two attentions for better sampling and frequency-dimension sampling. Different from
performance. processing a normal 2D image which has two spatial dimen-
sions, the spectrograms present time and frequency dimensions
where we believe it is a more meaningful way to sample
III. T HE S PECTROGRAM T RANSFORMERS them separately for exploring the contextual attentions in our
transformer models. Thus, we design time-dimension sampling
In this section, we explain the details of our proposed spec- method which generates embedding Et ∈ R100t×768 based
trogram transformers. We firstly present our overall system on the vectors zn t ∈ R
1×128
, n = 1...100t that match
pipeline followed by the two sampling mechanisms to extract to sliding FFT windows at each timestamp n. While for
features which are used in our attention blocks. Then, we the frequency-dimension sampling method, we produce the
introduce the four variants of the transformer architectures and frequency-dimension embedding Ef ∈ R128×768 from zm f ∈
their design logic. R100t×1 , m = 1...128.
putation, we propose frequency multi-head self-attention in our
transformer block to expand the original frequency features
from 128D to 768D via a linear projection layer. Since the
temporal features are more reliable compared to the frequency
features, we use the temporal multi-head self-attention module
before the frequency multi-head self-attention module in the
architecture.
3) Model3: Temporal-Frequency Parallel (TFP) Trans-
former: We make another design to explore the ensemble
effect of using both the temporal attention and frequency
attention to enhance the time-dimension embeddings. In this
architecture, we use temporal multi-head self-attention and
frequency multi-head self-attention in parallel with a residual
connection as illustrated in Fig. 3.
4) Model4: Two-stream Temporal-Frequency (TSTF) Trans-
former: The fourth model architecture is a two-stream struc-
ture. There are two differences between TSTF Transformer
with previous TFP Transformer and TFS Transformer: (i) we
design two individual pipelines to enhance both the time-
Fig. 2. The sampling methods used for Spectrogram Transformer. For a t- dimension embedding and frequency embedding via self-
second audio segment, a 128 × t dimension spectrogram could be generated. attention moduels before integrate the two features via an
128 refers to frequency bins and t refers to sliding FFT windows on time.
(a) Time-dimension sampling: By cutting through time-dimension, t time-
MLP head; and (ii) compared to TFS Transformer and TFP
dimension embedding with shape 128 × 1 could be generated. (b) Frequency- Transformer which fuse the temporal information and fre-
dimension sampling: By cutting through frequency-dimension, 128 frequency- quency information in each transformer block, the two-stream
dimension embedding with shape 1 × t could be generated.
Transformer integrate them till the last step.
IV. E MPIRICAL EVALUATION
C. Transformer architectures A. Experimental Setup
In our system, we investigate a set of combinations of Dataset We evaluate the performance of our proposed
the use of temporal and frequency attentions to improve the models on a audio classification dataset ESC-50 [29]. The
system performance. As illustrated in Fig. 3, we design four ESC-50 dataset is a single-label dataset which consists of 2000
transformer-based architectures for audio classification. We environmental audio recordings. Each audio recording in ESC-
firstly start from the simplest model Temporal Only (TO) 50 dataset is 5-second long. There are 50 semantical classes
Transformer which uses a temporal multi-head self-attention. loosely arranged into 5 major categories, including Animals,
Then, we add a frequency multi-head self-attention into the Natural soundscapes & water sounds, Human, non-speech
TO model sequentially and in parallel respectively which sounds, Interior/domestic sounds, and Exterior/urban noises.
are named Temporal-Frequency Sequential (TFS) Transformer The dataset is a balanced dataset which has 40 examples for
and Temporal-Frequency Parallel (TFP) Transformer. Finally, each class. We use 5-fold cross-validation for the experiments
we have the Two-stream Temporal-Frequency (TSTF) Trans- where the folds are prearranged in the dataset[29].
former which fuses the attentions before feeding them into Training details The spectrogram features are extracted
a MLP classification head. We use 12 heads self-attention from audio recordings before training the transformer models.
modules and stack six layers of transformer blocks in each For each 5 seconds audio, we extract 128-dimensional log
transformer encoder. Mel filterbank energy features Z ∈ R128×500 using the setting
1) Model1: Temporal Only (TO) Transformer: Since audio described in Section III-A. We use SpecAugment [30] for data
is time series data, the sequential information of features are argumentation. The idea of SpecAugment is using a random
extremely important. Thus, We design the “TO” Transformer length of mask to filter out blocks of frequency channels and
for audio spectrograms to capture the attentions between the time windows. The maximum length of the frequency mask is
features at temporal space. Simply put, our “TO” transformer 24 while the maximum length of frequency mask is 96. The
implements the original multi-head self-attention [28] on the time-dimension embedding Et ∈ R500×768 and frequency-
dimension of time. dimension embedding Ef ∈ R128×768 could be computed by
2) Model2: Temporal-Frequency Sequential (TFS) Trans- two linear embedding layers based on the masked spectrogram
former: Inspired by Wu et al. [7], the attention from frequency features. Finally, we input embeddings Ept ∈ R501×768 and
bands also contribute to improve the classification accuracy. Epf ∈ R129×768 into our different model architectures.
In this architecture, we explore to use the frequency attention Notably, most transformer-based audio classification meth-
and the temporal attention to further enhance the temporal ods are strongly rely on pre-training on ImageNet, e.g., AST
feature. To diversify frequency features for the attention com- [9], and PaSST [19]. Transformer-based method shows a huge
Fig. 3. The 4 Spectrogram Transformer architectures: (a) Temporal Only Transformer, (b) Temporal-Frequency Sequential Transformer, (c) Temporal-Frequency
Parallel Transformer, and (d) Two-stream Temporal-Frequency Transformer. L refers to the number of transformer blocks.

gap when pre-traning of ImageNet is not available. In our TABLE I

experiments, we focus on the performance of the transformer C OMPARISON OF DIFFERENT MODELS PERFORMANCE ON ESC-50
DATASET. T HE RESULTS OF K NN, SVM, AND C ONVOLUTIONAL
model when there is no pre-training model. Thus, all of our AUTOENCODER ARE FROM THE PAPERS [31, 32] DIRECTLY. T HE RESULT
models and compared are trained from scratch using random OF AST IS OBTAINED BY RUNNING THE OFFICIAL CODE [9] USING THE
initial weight without any pre-training. ORIGINAL SETTINGS .

We train our model using batch size 28 with SGD optimizer Method Top-1 Accuracy
(momentum 0.9 and weight decay 1e-4) and cross-entropy loss
KNN [31] 32.20
in our experiments. Our model is trained for 50 epochs with
an initial learning rate 1e-2 which is decayed to 1e-3 after SVM [31] 39.60
30th epoch. Convolutional Autoencoder [32] 39.90
We choose the champion of the ESC-50 leaderboard method AST [9] 45.35
AST [9] for comparison. The training code used is from Proposed TSTF Transformer 57.24
their GitHub repository without any change, except slightly
reducing batch size to fit our GPU.
TABLE II
P ERFORMANCE COMPARISON OF S PECTROGRAM T RANSFORMER MODELS
B. Comparison to state-of-the-art methods AND AST ON ESC-50 DATASET. T OP -1 ACCURACY IS PRESENTED .

We compare our models to the state-of-the-art methods,

Model fold 1 fold 2 fold 3 fold 4 fold 5 Average
including kNN, SVM, Convolutional Autoencoder and AST
AST [9] 45.00 44.50 44.25 48.25 44.75 45.35
on ESC-50 dataset. Table I shows the results of the benchmark
models and ours in terms of Top-1 Accuracy. All of our models TO 56.00 51.25 62.36 61.75 54.50 57.17
outperform these models on the ESC-50 dataset. Specifically, TFS 45.75 45.75 52.68 50.25 46.00 48.09
our TSTF transformer achieves 57.24 % on top-1 accuracy TFP 49.50 47.00 59.13 56.50 52.25 52.88
which is 11.8 % better compared to the latest AST [9] method. TSTF 56.25 52.50 61.72 60.25 55.50 57.24

C. Ablation studies
We perform an empirical study to understand the perfor- on ESC-50 without pre-training which outperforms state-of-
mance of our proposed architectures of Spectrogram Trans- the-art method AST by 11.82%. Temporal-Frequency Se-
formers when compared to the state-of-the-art transformer- quential (TFS) Transformer performs a slightly better result
based method AST [9]. The findings are summarised as below: than AST at 48.09%, whereas another temporal-frequency
Model Accuracy We firstly conduct the experiments to model Temporal-Frequency Parallel (TFP) Transformer shows
evaluate the model accuracy shown in Table II. The Tempo- a higher accuracy at 52.88%. The fourth architecture Two-
ral Only (TO) Transformer achieves 57.17% top-1 accuracy stream Temporal-Frequency (TSTF) Transformer is the best
variant which achieves 57.24% top-1 accuracy and outper- sampling and frequency-dimension sampling, and four trans-
forms state-of-the-art method AST by 11.89%. former architectures for the audio classification task. All
From Table II, we find that Temporal-Frequency Serial these architectures achieved state-of-the-art results on ESC-
(TFS) Transformer and Temporal-Frequency Parallel (TFP) 50 dataset when using transformer-based method, with their
Transformer which study frequency information in the trans- own distinctive advantages, either on accuracy or efficiency
former block rather than in an individual stream are not improvement. The main limitation is that there is still a
perform as good as Two-stream Temporal-Frequency (TSTF) significant performance gap when compared to the transformer
Transformer. This may be caused by the loss of positional with pre-trained model. In future work, we will investigate to
information of frequency. In TFS Transformer and TFP Trans- integrate pre-trained model in our transformer architectures to
former, unlike to temporal dimension, there is no positional improve the accuracy.
embedding for the frequency dimension. The model is dif-
ACKNOWLEDGEMENT
ficult to capture the sequential information in the frequency
dimension. The authors would like to thank China Scholarship Council
and Loughborough University for supporting his study.
TABLE III R EFERENCES
C OMPARISON OF OUR MODELS WITH AST IN COST. MSA REFERS TO
MULTI - HEAD SELF - ATTENTION
[1] Y. Rui, A. Gupta, and A. Acero, “Automatically extracting highlights for
tv baseball programs,” in Proceedings of the eighth ACM international
conference on Multimedia, 2000, pp. 105–115.
Model Layers No. of MSA GFLOPs Params [2] N. Almaadeed, M. Asim, S. Al-Maadeed, A. Bouridane, and A. Begh-
AST [9] 12 12 49.40 86.86 M dadi, “Automatic detection and classification of audio events for road
surveillance applications,” Sensors, vol. 18, no. 6, p. 1858, 2018.
TO 6 6 21.64 43.05 M [3] R. V. Sharan and T. J. Moir, “An overview of applications and advance-
TFS 6 12 27.12 85.56 M ments in automatic sound recognition,” Neurocomputing, vol. 200, pp.
22–34, 2016.
TFP 6 12 27.12 85.56 M [4] F. Jia, Y. Lei, L. Guo, J. Lin, and S. Xing, “A neural network constructed
TSTF 6 12 23.46 57.21 M by deep learning technique and its application to intelligent fault
diagnosis of machines,” Neurocomputing, vol. 272, pp. 619–628, 2018.
[5] D. Stowell, M. Wood, Y. Stylianou, and H. Glotin, “Bird detection
in audio: a survey and a challenge,” in 2016 IEEE 26th International
Model efficiency We observe that our model has a lower Workshop on Machine Learning for Signal Processing (MLSP). IEEE,
inference cost as well as fewer parameters. Table III presents 2016, pp. 1–6.
the comparison of model capacity between AST model and [6] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen,
R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al.,
our models. “Cnn architectures for large-scale audio classification,” in 2017 ieee
Our Temporal Only (TO) Transformer has the most similar international conference on acoustics, speech and signal processing
architecture with AST model. We use 6-layer architecture for (icassp). IEEE, 2017, pp. 131–135.
[7] Y. Wu, H. Mao, and Z. Yi, “Audio classification using attention-
Temporal Only (TO) Transformer. As the result, the number augmented convolutional neural network,” Knowledge-Based Systems,
of parameters of TO Transformer is about 50% of the AST vol. 161, pp. 90–100, 2018.
model. We also benefit from time-dimension sampling method [8] H. Liang and Y. Ma, “Acoustic scene classification using attention-based
convolutional neural network,” DCASE2019 Challenge, Tech. Rep., Jun.
which leads less the number of embedding, the FLOPs of TO 2019.[Online]. Available: [Link] . . . , Tech. Rep., 2019.
Transformer is only about 43.8% of the AST model. [9] Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram trans-
Compared with TO Transformer, both TFS Transformer former,” arXiv preprint arXiv:2104.01778, 2021.
[10] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,”
and TFP Transformer have two multi-head self-attentions in in 2014 IEEE International Conference on Acoustics, Speech and Signal
the transformer block instead of one. The FLOPs of these Processing (ICASSP). IEEE, 2014, pp. 6964–6968.
Transformers are slightly increased, while the number of [11] X. Zhang, Y. Zou, and W. Shi, “Dilated convolution neural network
with leakyrelu for environmental sound classification,” in 2017 22nd
parameters nearly doubled. international conference on digital signal processing (DSP). IEEE,
For Two-stream Temporal-Frequency (TSTF) Transformer, 2017, pp. 1–5.
we also use 6-layer architecture. Due to the sequence length of [12] Z. Chi, Y. Li, and C. Chen, “Deep convolutional neural network
combined with concatenated spectrogram for environmental sound clas-
frequency stream 129 is significantly lower than the sequence sification,” in 2019 IEEE 7th International Conference on Computer
length of temporal stream 501 and the algorithm complexity of Science and Network Technology (ICCSNT). IEEE, 2019, pp. 251–
multi-head self-attention is O(n2 d + nd2 ), the computational 254.
[13] A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic
cost of the frequency stream is much less than the temporal scene classification and sound event detection,” in 2016 24th European
stream and the computational cost of 6-layer TSTF Trans- Signal Processing Conference (EUSIPCO). IEEE, 2016, pp. 1128–
former (12 multi-head self-attentions in total) is much less 1132.
[14] Z. Mushtaq and S.-F. Su, “Environmental sound classification using a
than 12-layer Temporal Only (TO) Transformer. regularized deep convolutional neural network with data augmentation,”
Applied Acoustics, vol. 167, p. 107389, 2020.
V. C ONCLUSION & F UTURE WORK [15] D. M. Agrawal, H. B. Sailor, M. H. Soni, and H. A. Patil, “Novel teo-
based gammatone features for environmental sound classification,” in
In this work, we proposed spectrogram transformers to im- 2017 25th European Signal Processing Conference (EUSIPCO). IEEE,
2017, pp. 1809–1813.
prove the performance of audio classification. Specifically, we [16] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Esresnet: Environmental
design two sampling mechanisms, including time-dimension sound classification based on visual domain models,” in 2020 25th
International Conference on Pattern Recognition (ICPR). IEEE, 2021,
pp. 4933–4940.
[17] K. Palanisamy, D. Singhania, and A. Yao, “Rethinking cnn models for
audio classification,” arXiv preprint arXiv:2007.11154, 2020.
[18] L. Pepino, P. Riera, and L. Ferrer, “Study of positional encod-
ing approaches for audio spectrogram transformers,” arXiv preprint
arXiv:2110.06999, 2021.
[19] K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Effi-
cient training of audio transformers with patchout,” arXiv preprint
arXiv:2110.05069, 2021.
[20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans-
formers for image recognition at scale,” in International Conference on
Learning Representations, 2021.
[21] A. Rakotomamonjy and G. Gasso, “Histogram of gradients of time–
frequency representations for audio scene classification,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1,
pp. 142–153, 2014.
[22] F. Lieb and H.-G. Stark, “Audio inpainting: Evaluation of time-frequency
representations and structured sparsity approaches,” Signal Processing,
vol. 153, pp. 291–299, 2018.
[23] M. Huzaifah, “Comparison of time-frequency representations for en-
vironmental sound classification using convolutional neural networks,”
arXiv preprint arXiv:1706.07156, 2017.
[24] J. J. Huang and J. J. A. Leanos, “Aclnet: efficient end-to-end audio
classification cnn,” arXiv preprint arXiv:1811.06669, 2018.
[25] A. Kumar and V. K. Ithapu, “Secost:: Sequential co-supervision for large
scale weakly labeled audio event detection,” in ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2020, pp. 666–670.
[26] S. Verbitskiy, V. Berikov, and V. Vyshegorodtsev, “Eranns: Efficient
residual audio neural networks for audio pattern recognition,” arXiv
preprint arXiv:2106.01621, 2021.
[27] X. Li, V. Chebiyyam, and K. Kirchhoff, “Multi-stream network
with temporal attention for environmental sound classification,” arXiv
preprint arXiv:1901.08608, 2019.
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in neural information processing systems, 2017, pp. 5998–6008.
[29] K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,”
in Proceedings of the 23rd Annual ACM Conference on
Multimedia. ACM Press, pp. 1015–1018. [Online]. Available:
[Link]
[30] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,
and Q. V. Le, “Specaugment: A simple data augmentation method for
automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
[31] K. J. Piczak, “Esc: Dataset for environmental sound classification,” in
Proceedings of the 23rd ACM international conference on Multimedia,
2015, pp. 1015–1018.
[32] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound
representations from unlabeled video,” Advances in neural information
processing systems, vol. 29, 2016.

Audio Classification
No ratings yet
Audio Classification
6 pages
Nietjet 0602S 2018 003
No ratings yet
Nietjet 0602S 2018 003
5 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
Muzic Genre Classification
No ratings yet
Muzic Genre Classification
4 pages
MelNet: Advanced Audio Generation Model
No ratings yet
MelNet: Advanced Audio Generation Model
14 pages
Mrac Paper1a
No ratings yet
Mrac Paper1a
11 pages
Vision Transformer Based Audio Classification Using Patch-Level Feature Fusion
No ratings yet
Vision Transformer Based Audio Classification Using Patch-Level Feature Fusion
5 pages
2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
No ratings yet
2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
5 pages
Spectnet: End-To-End Audio Signal Classification Using Learnable Spectrogram Features
No ratings yet
Spectnet: End-To-End Audio Signal Classification Using Learnable Spectrogram Features
8 pages
Paper 10
No ratings yet
Paper 10
7 pages
Spectrogram-Based Analysis and Detection of Deepfake Audio Using Enhanced Dcgans For Secure Content Distribution
No ratings yet
Spectrogram-Based Analysis and Detection of Deepfake Audio Using Enhanced Dcgans For Secure Content Distribution
6 pages
Timbre Analysis of Music Audio Signals With Convolutional Neural Networks
No ratings yet
Timbre Analysis of Music Audio Signals With Convolutional Neural Networks
5 pages
A New Deep CNN Model For Environmental Sound Classification
No ratings yet
A New Deep CNN Model For Environmental Sound Classification
9 pages
Pad Assignment 2
No ratings yet
Pad Assignment 2
12 pages
Deep Learning for Audio Noise Detection
No ratings yet
Deep Learning for Audio Noise Detection
29 pages
Paper 4-Enhancing Audio Classification Through MFCC
No ratings yet
Paper 4-Enhancing Audio Classification Through MFCC
17 pages
Pert Usa PHD
No ratings yet
Pert Usa PHD
232 pages
Ph.D. Thesis Computationally Efficient Methods For Polyphonic Music Transcription
No ratings yet
Ph.D. Thesis Computationally Efficient Methods For Polyphonic Music Transcription
232 pages
Deep Learning Audio Classification
No ratings yet
Deep Learning Audio Classification
25 pages
Deep Learning For Audio Signal Processing
No ratings yet
Deep Learning For Audio Signal Processing
14 pages
Paper 10
No ratings yet
Paper 10
18 pages
Music Emotion Recognition System
No ratings yet
Music Emotion Recognition System
3 pages
DL For Acoustics
No ratings yet
DL For Acoustics
4 pages
Randomly Weighted CNNs For Audio Classification
No ratings yet
Randomly Weighted CNNs For Audio Classification
5 pages
Lightweight 1D CNN for Sound Classification
No ratings yet
Lightweight 1D CNN for Sound Classification
10 pages
Paper 10
No ratings yet
Paper 10
19 pages
Deep Learning for Music Genre Classification
No ratings yet
Deep Learning for Music Genre Classification
11 pages
Multi-Instrument Music Synthesis With Spectrogram Diffusion
No ratings yet
Multi-Instrument Music Synthesis With Spectrogram Diffusion
12 pages
A Discriminative Model For Polyphonic Piano Transcription
No ratings yet
A Discriminative Model For Polyphonic Piano Transcription
9 pages
Seminar 2
No ratings yet
Seminar 2
34 pages
Research On Music Classification Technology Based
No ratings yet
Research On Music Classification Technology Based
13 pages
HTS-AT: Efficient Audio Transformer
No ratings yet
HTS-AT: Efficient Audio Transformer
5 pages
cmmr2021 24
No ratings yet
cmmr2021 24
10 pages
Deep Convolutional Neural Network With Mixup
No ratings yet
Deep Convolutional Neural Network With Mixup
12 pages
Data Augmentation1
No ratings yet
Data Augmentation1
9 pages
Deep Representation Learning Techniques For Audio Signal Processing
No ratings yet
Deep Representation Learning Techniques For Audio Signal Processing
152 pages
Expert Systems With Applications: P. Dhanalakshmi, S. Palanivel, V. Ramalingam
No ratings yet
Expert Systems With Applications: P. Dhanalakshmi, S. Palanivel, V. Ramalingam
7 pages
Acoustic Scene Classification Method
No ratings yet
Acoustic Scene Classification Method
4 pages
Musical Instrument Timbre Classification
100% (1)
Musical Instrument Timbre Classification
10 pages
BTP Final
No ratings yet
BTP Final
16 pages
Paper 10
No ratings yet
Paper 10
9 pages
Instrument Timbre Transformation Thesis
No ratings yet
Instrument Timbre Transformation Thesis
63 pages
Applied Acoustics: Özkan - Inik
No ratings yet
Applied Acoustics: Özkan - Inik
25 pages
Musical Instrument Identi Cation With Feature Selection Using Evolutionary Methods Loughran Thesis
No ratings yet
Musical Instrument Identi Cation With Feature Selection Using Evolutionary Methods Loughran Thesis
281 pages
10 1109@JSTSP 2019 2909479
No ratings yet
10 1109@JSTSP 2019 2909479
13 pages
Pfa Inr
No ratings yet
Pfa Inr
75 pages
Wei 2020 J. Phys. - Conf. Ser. 1453 012085
No ratings yet
Wei 2020 J. Phys. - Conf. Ser. 1453 012085
9 pages
Deep BiDirec Transformers-Base Masked Predictive
No ratings yet
Deep BiDirec Transformers-Base Masked Predictive
17 pages
Automatic Audio Feature Extraction For Keyword Spotting
No ratings yet
Automatic Audio Feature Extraction For Keyword Spotting
5 pages
Brent William
No ratings yet
Brent William
173 pages
Chord Detection Using Deep Learning
No ratings yet
Chord Detection Using Deep Learning
7 pages
Zhang16h Interspeech
No ratings yet
Zhang16h Interspeech
5 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Espi2015 Article ExploitingSpectro-temporalLoca
No ratings yet
Espi2015 Article ExploitingSpectro-temporalLoca
12 pages
5 Sgasgs
No ratings yet
5 Sgasgs
6 pages
Urban Sound Classification For Audio Analysis Using Long Short-Term Memory
No ratings yet
Urban Sound Classification For Audio Analysis Using Long Short-Term Memory
11 pages
DAFx2018 Paper 63
No ratings yet
DAFx2018 Paper 63
9 pages
An Audio Spectrogram Transformer For All Length and Resolutions
No ratings yet
An Audio Spectrogram Transformer For All Length and Resolutions
5 pages
FFFFFFFFFFFFFFFFFF
No ratings yet
FFFFFFFFFFFFFFFFFF
12 pages
001-50475 An50475 001-50475
No ratings yet
001-50475 An50475 001-50475
18 pages
20 Watt Class-A Amplifier
No ratings yet
20 Watt Class-A Amplifier
4 pages
2-Way Active Crossover With Linear Phase Response
No ratings yet
2-Way Active Crossover With Linear Phase Response
2 pages
Nanoparticle Heating vs. Induction Heaters
No ratings yet
Nanoparticle Heating vs. Induction Heaters
5 pages
The Audiophiles Project Sourcebook 120 High Performance Audio Electronics Projects
50% (2)
The Audiophiles Project Sourcebook 120 High Performance Audio Electronics Projects
32 pages
Borbely On Line Amps
No ratings yet
Borbely On Line Amps
26 pages
Inglés Jun 25
No ratings yet
Inglés Jun 25
5 pages
66 Eaa 7654 e 534 PGDCACBCS
No ratings yet
66 Eaa 7654 e 534 PGDCACBCS
28 pages
1 s2.0 S0045790625001910 Main
No ratings yet
1 s2.0 S0045790625001910 Main
29 pages
(9781800375949 - FinTech) Chapter 1 - INTRODUCTION - WHAT IS FINTECH
No ratings yet
(9781800375949 - FinTech) Chapter 1 - INTRODUCTION - WHAT IS FINTECH
21 pages
SEPM Module 1
No ratings yet
SEPM Module 1
56 pages
An Extensive Analysis of Neuromorphic Computing
No ratings yet
An Extensive Analysis of Neuromorphic Computing
5 pages
PLC 631-Session 1
No ratings yet
PLC 631-Session 1
25 pages
AI Assignment 2
No ratings yet
AI Assignment 2
2 pages
Text-to-Image GAN Synthesis Study
No ratings yet
Text-to-Image GAN Synthesis Study
1 page
Zou Et Al. - 2020 - A Robust License Plate Recognition Model Based On
No ratings yet
Zou Et Al. - 2020 - A Robust License Plate Recognition Model Based On
12 pages
Unit 1 ML
No ratings yet
Unit 1 ML
23 pages
Set-1 QP
No ratings yet
Set-1 QP
24 pages
ANN Lab Syllabus
No ratings yet
ANN Lab Syllabus
2 pages
CyberBean Computer Notes For Class 5th
No ratings yet
CyberBean Computer Notes For Class 5th
11 pages
Plug and Play Active Learning For Object Detection
No ratings yet
Plug and Play Active Learning For Object Detection
10 pages
AI-Powered Invoice Processing RPA
No ratings yet
AI-Powered Invoice Processing RPA
13 pages
(Oct-2024) New PassLeader AI-102 Exam Dumps
No ratings yet
(Oct-2024) New PassLeader AI-102 Exam Dumps
13 pages
History of AI 12
No ratings yet
History of AI 12
1 page
Advance AI ML
No ratings yet
Advance AI ML
57 pages
Lecture 11 - Supervised Learning - Hopfield Networks - (Part 4)
No ratings yet
Lecture 11 - Supervised Learning - Hopfield Networks - (Part 4)
5 pages
Bekiri Industry Analyisis
No ratings yet
Bekiri Industry Analyisis
8 pages
AI's Future: Benefits & Challenges
No ratings yet
AI's Future: Benefits & Challenges
9 pages
Pe Firms in Nepal and Their Portfolio Combined
No ratings yet
Pe Firms in Nepal and Their Portfolio Combined
14 pages
Parajumbles/Rearrangement
No ratings yet
Parajumbles/Rearrangement
6 pages
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
No ratings yet
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
22 pages
Brochure Big Data
No ratings yet
Brochure Big Data
6 pages
Assets File Syl Sam Papers Ai 5
No ratings yet
Assets File Syl Sam Papers Ai 5
1 page
Americas Technology - Internet - Framing The AI Narrative - Five Key Debates About Industry Trends and Market Environment
No ratings yet
Americas Technology - Internet - Framing The AI Narrative - Five Key Debates About Industry Trends and Market Environment
24 pages
Aplia Homework Help for Students
100% (1)
Aplia Homework Help for Students
6 pages
1.AG60202 - AI Applications in Agriculture - Rajendra Machavaram - Spring 2022-23
No ratings yet
1.AG60202 - AI Applications in Agriculture - Rajendra Machavaram - Spring 2022-23
26 pages

Spectrogram Transformers For Audio Classification

Uploaded by

Spectrogram Transformers For Audio Classification

Uploaded by

This item was submitted to Loughborough's Research Repository by the author.

Spectrogram transformers for audio classification

All Rights Reserved

gap when pre-traning of ImageNet is not available. In our TABLE I

We compare our models to the state-of-the-art methods,

You might also like