0% found this document useful (0 votes)
50 views5 pages

A Robust Audio Deepfake Detection System Via Multi-View Feature

Uploaded by

mrajpal2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views5 pages

A Robust Audio Deepfake Detection System Via Multi-View Feature

Uploaded by

mrajpal2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 979-8-3503-4485-1/24/$31.

00 ©2024 IEEE | DOI: 10.1109/ICASSP48485.2024.10446560

A ROBUST AUDIO DEEPFAKE DETECTION SYSTEM VIA MULTI-VIEW FEATURE

Yujie Yang†, Haochen Qin†, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han*, Yunhe Wang*

Huawei Noah’s Ark Lab

ABSTRACT learning based features. Hand-crafted features, although sim-


ple, have acceptable performance since their special design to
With the advancement of generative modeling techniques, extract audio properties. For instance, the constant-Q trans-
synthetic human speech becomes increasingly indistinguish- form (CQT) is good at capturing both long-range features
able from real, and tricky challenges are elicited for the audio and fine details in audio signals with its different filter win-
deepfake detection (ADD) system. In this paper, we ex- dow lengths across frequency bands[4]. MFCC and LFCC
ploit audio features to improve the generalizability of ADD features, have good match with human auditory characteris-
systems. Investigation of the ADD task performance is con- tics and emphasis on low-frequency information to bolster
ducted over a broad range of audio features, including various speech detection tasks. In recent years, the application of
handcrafted features and learning-based features. Experi- learning-based audio features in ADD tasks has attracted
ments show that learning-based audio features pretrained on tremendous attention. Research has explored the use of audio
a large amount of data generalize better than hand-crafted features of Whisper [5] for detecting synthetic speech. Large
features on out-of-domain scenarios. Subsequently, we fur- amounts of audio data support the Whisper ASR system,
ther improve the generalizability of the ADD system using demonstrating their superiority over handcrafted features [6].
proposed multi-feature approaches to incorporate compli- Similarly, self-supervised learning-based audio features have
mentary information from features of different views. The also proven beneficial for ADD tasks [7] The success of self-
model trained on ASV2019 data achieves an equal error rate supervised models in various scenarios can be attributed to
of 24.27% on the In-the-Wild dataset. The code will be the usage of extensive pre-training data sourced from diverse
released as soon 1 . domains, ensuring the model to produce meaningful audio
Index Terms— Audio deepfake detection, anti-spoofing, features even in complicated situations. These features aid
feature incorporation in distinguishing between real and fake speech and perform
well on out-of-domain dataset [8].
However, performance of ADD model based on single
1. INTRODUCTION feature might be degraded since the spurious speech can be
generated from distinctive audio synthesis systems, where
AI technology currently has made breakthroughs with the single feature fails to represent characteristics of all the syn-
support of large-scale models, massive datasets, and powerful thesis systems. Based on such phenomenon, we propose to
computing capabilities. Speech synthesis, speech conversion, use multiple features, which can improve the model gener-
and speech editing technologies have been able to generate alizability by providing information from different aspects.
human speech that is virtually indistinguishable from real Two methods are proposed based on feature selection and fea-
human speech. However, progress in these speech generation ture fusion respectively. These approaches can better capture
technologies has also raised potential threats. The synthetic the subtle differences between fake speech and real speech,
speech could be misused for spreading rumors, executing bolstering detection system accuracy of identifying deep
fraud, and eliciting other illicit activities. Therefore, identi- forgery samples especially generated by unknown synthesis
fying synthetic speech is increasingly important. In response systems.
to such challenges, efforts including automatic speaker verifi- This work focuses on improving the generalizability of
cation spoofing and countermeasures (ASVspoof), and audio ADD system, and contributions are:
deepfake detection (ADD) competitions have been held to
collect solutions [1, 2, 3]. 1. We investigate a broad range of handcrafted features
Many works in ADD focus on finding proper audio fea- and learning-based deep features. Experimental re-
tures, which can be roughly categorized into hand-crafted and sults show strong generalizability for learning-based
features pretrained on large amounts of data.
†Euqal contribution
*Corresponding author 2. We propose two multi-view feature incorporation meth-
1 Mindspore: https://gitee.com/mindspore/models ods to capture subtlety of the multiple candidate fea-

979-8-3503-4485-1/24/$31.00 ©2024 IEEE 13131 ICASSP 2024

Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.
tures to further improve the performance and general- we propose two methods based on feature selection and fea-
izability of the system. ture fusion respectively.

2. AUDIO FEATURES AND MULTI-VIEW FEATURE


2.2.1. Feature selection
INCORPORATION
Identifying the most effective feature for the ADD task is dif-
In this section, we first introduce audio features of the hand-
ficult, especially for test data with unknown distributions. So,
crafted and learning-based approaches, which are investigated
we introduce multi-feature candidates to improve the gener-
in the experiments section. Then, the proposed feature multi-
alization of the ADD system. However, the introduction of
view approaches based on feature selection and feature fusion
redundant or irrelevant features may hinder the learning pro-
are demonstrated.
cess of the classifier. Therefore, we propose a feature se-
lection mechanism that decides whether to introduce a fea-
2.1. Audio features ture into the decision process based on sample-specific infor-
mation, thus exploiting the information provided by multi-
2.1.1. Hand-crafted features features while avoiding the negative impact of certain fea-
Hand-crafted acoustic features have been well investigated tures.
in ADD studies. In this paper, we evaluate 5 hand-crafted
features including Mel-scaled spectrogram (Mel), Mel fre- mi = Sθ (fi )
(1)
quency cepstral coefficient (MFCC), log-frequency spec- Fselect = Concat({fi ⊙ mi }), i ∈ [0 , N ]
trogram (LogSpec), linear frequency cepstral coefficients
(LFCC) and constant-Q transform (CQT). Our proposed feature selection mechanism is shown as eq1.
Where fi denotes the candidate features, mi and Fselect is the
selected mask and select features respectively. Each feature
2.1.2. Learning-based features
goes through a selection module Sθ before concatenated and
Learning-based acoustic features are generated from various fed into the classifier. This module consists of lightweight
audio tasks, and there is already precedent for using them for self-attentive layers, and the output of the module is a binary
ADD tasks [9, 6, 7]. In this paper, 9 learning-based audio mask that determines whether the feature should be used in
features proposed for various tasks have been extensively in- the decision for this sample. The discrete decision is obtained
vestigated and benchmarked for generalization performance by the Gumbel-max method, thus allowing the selection mod-
on the ADD task. ule to be trained end-to-end with the whole system.
The learnable acoustic front-ends automatically get the
proper filter banks while optimizing the objective. We adopt
SincNet [10] and LEAF [11] as learnable front-ends for ADD. 2.2.2. Feature fusion
Besides, we also evaluate a range of deep learning-based au-
dio features, where the use of additional data as well as task- Feature fusion, on the other hand, can incorporate all infor-
related training approaches can be beneficial for the ADD mation in the multi-view feature without deleting any views.
task. 7 deep learning-based models across various tasks are To smoothly incorporate acoustic representations from differ-
chosen to generate audio features. For audio neural codec ent pretrained models, we combine channel attention mecha-
models, we use EnCodec [12] and AudioDec [13] mainly con- nism and Transformer encoder to build a feature fusion mod-
sisted of autoencoder architecture and aimed to encode audio ule in (2). The multi-view feature, formed by concatenat-
compactly. AudioMAE [14] is selected as representative of ing candidate fi on channel dimension, is first processed by
pretrained model towards universal audio perception. For pre- a lightweight channel attention block to fuse on channel level
trained model on human speech, we select Wav2Vec2, Hubert (each channel represents one deep feature). Then, a Trans-
[15], and WavLM [16], which share similar network architec- former encoder is applied to fuse the feature ri , on both time
tures but different self-supervised losses. For the ASR model, and frequency dimensions. With element-wise global recep-
we use Whisper [5] model trained on a large dataset from di- tive field, the final fused representation Ffusion are input into
verse speech scenario. the classifier.

ri = CA(Concat(fi ))
2.2. Multi-view feature incorporation (2)
Ffusion = TE(Concat({ri })), i ∈ [0 , N ]
Features extracted from different deep models contain unique
information, which can further boost ADD model generaliz- Where CA means channel-attention and TE means vallina
ability with proper feature incorporation methods. Therefore, Transformer encoder.

13132

Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.
3. EXPERIMENTS
Table 1. Performance of various single audio features on the
3.1. Datasets ADD task EER (%)

We train our models on the train and dev subsets of the ASVspoof19 ASVspoof21
Features In-the-Wild
ASVSpoof 2019 Logical Access (LA) dataset part [17], LA eval DF eval
which is consistent with most related works. To evaluate Mel 7.42 20.13 50.56
our systems, we adopt three datasets. The eval subset of MFCC 6.45 27.27 75.43
ASVspoof 2019 and 2021 challenge are used to test the per- LogSpec 5.67 20.62 52.93
formance within similar domains [2]. The spoof audio of LFCC 15.35 25.67 65.45
the ASVspoof challenge is generated by 11 TTS and 8 VC CQT 4.91 20.75 56.69
algorithms from VCTK corpus. The samples of its eval sub- LEAF 8.54 21.54 49.70
set is generated with different algorithms compared to the SincNet 6.12 20.78 56.74
train subset. To evaluate the generalization ability of our sys- EnCodec 10.25 24.93 39.44
tems, we also test our systems on In-the-Wild dataset, which AudioDec 10.47 26.13 43.69
contains 20.8 hours of real audio and 17.2 hours of deep- AudioMAE 11.07 30.47 75.40
fake audio[18]. The In-the-Wild dataset is collected from the XLS-R 2.07 11.78 29.19
Internet and consists of audio from various realistic scenarios. Hubert 6.78 14.76 27.48
WavLM 7.24 15.53 30.50
3.2. Implement details Whisper 5.59 23.28 42.73

All audio samples are trimmed or padded to 4s and resam-


pled to 16kHz for all acoustic features except the neural audio and the In-the-Wild dataset. However, the results from the
codec models EnCodec and AudioDec, which support sam- ASV2021 DF evaluation dataset and the In-the-Wild dataset
ple rate of 24kHz. For all handcrafted features, the window show that the ADD system trained on the ASV2019 dataset
length and hop length are set to 25 ms and 10 ms, respectively. is poorly generalized. The ASV2021 DF dataset contains
For speech self-supervised models, we employ the Wav2- samples from various spoofing systems that utilize different
Vec2 XLS-R [19] model pretrained on 128 languages, Hubert- audio codec processing methods. For the In-the-Wild dataset,
base model pretrained on LibriSpeech, WavLM-Base-Plus samples are collected from complex environments outside of
model pretrained on Libri-Light, GigaSpeech and VoxPopuli professional studio and the speech content differs.
datasets. For the neural audio codec models, the continu- In our experiments, the handcrafted features fail to show
ous features of encoder output, instead of the discrete code, reliable discrimination ability in realistic scenario. All the
are used as audio features to prevent information loss. The systems using handcrafted features get ERR greater than 50%
AudioMAE model used in our experiments is pretrained on in the In-the-Wild dataset. The learnable front-end Leaf and
AudioSet. The selected Whisper model is a tiny version pre- SincNet learn filter banks during training, but still generalize
trained on the speech recognition task. Besides, We select the poorly, with EERs 56.69 and 49.70 respectively.
24khz version of the EnCodec and AudioDec model. Audio On the contrary, most deep features show stronger gener-
features extracted from above deep models are output of their alizability. The neural audio codecs EnCodec and AudioDec
encoders respectively. emphasize the compression rate and the fidelity of the de-
For all experiments, we use a ResNet18 as classifier. We coded audio. While underperforming on the ASV2019 LA
train all of our systems with a cross-entropy loss. We use and ASV2021 DF evaluation datasets, these two models get
Adam optimizer with fixed learning rate at 1e-4 and weight an EER of 39.44 and 43.69 on the In-the-Wild dataset. The
decay at 1e-4. We train all the systems 100 epochs. Check- Wav2Vec2 XLS-R model is pretrained on 436K hours of
point with lowest loss on validation set is saved for evaluation. speech in 128 languages, based on which the system achieves
All systems are evaluated by equal-error rate (EER). the best EER on the ASV2019 LA and ASV2021 DF datasets.
For results in the In-the-Wild dataset, EER decreases by 21.37
in comparison to the best manual features. The Hubert and
4. RESULT AND ANALYSIS
WavLM features also perform excellently on the In-the-Wild
dataset, where the Hubert feature achieves the best EER
4.1. Single feature
among all single-feature detection systems at 27.48.
Table 1 shows the results of our experiments, where we eval- Of all the deep features, the AudioMAE model pretrained
uate 14 audio features under the same experimental setup and on the audio spectrograms using mask autoencoder shows the
test our systems on 3 datasets. The classification results for all poorest generalization on the ADD task. The EER is 75.40
features on the ASV2019 LA evaluation set are significantly on the In-the-Wild dataset, which is even worse than most of
better than those on the ASV2021 DF evaluation dataset the handcrafted features. The failure might be attributed to

13133

Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.
Table 2. The comparison of EER (%) score on In-the-Wild
dataset.
Model Features In-the-Wild
RawNet2[20] waveform 36.74
RawNet2[18] waveform 33.94
AASIST[20] waveform 34.81
ResNet34[20] XLS-R 46.35
LCNN[20] XLS-R 39.82
(a) (b) Res2Net[20] XLS-R 36.62
ResNet18(ours) XLS-R 29.19
Fig. 1. Visualization of CQT features and Hubert features for ResNet18(ours) Hubert 27.48
real and fake speech in the In-the-Wild dataset. ResNet18(ours) WavLM 30.50
Selection(ours) XLS-R,WavLM,Hubert 25.98
Fusion(ours) XLS-R,WavLM,Hubert 24.27
the pretrained dataset. AudioMAE is pretrained on the Au-
dioset dataset, which contains more universal audio than hu-
man speech, dispersing the ability to discriminate between attends to any other value not only across time and frequency
true and false human speech. The whisper feature also fails to dimension, but also feature dimension. So, the fused feature
generalize well, even though pretrained with more than 680k are better representation for the ResNet18 classifier to get the
hours of unlabeled speech data. This feature is obtained by best EER on the In-the-Wild dataset.
weakly supervised training on the ASR task, which focuses
more on speech content instead of audio signal information. 5. CONCLUSION
To visualize the superior generalizability of deep features
over hand-crafted features, Fig 4.1 shows the visualization of In this paper, we study the association between audio features
CQT features and Hubert features for real and fake speech and the generalizability of the ADD system. First, more audio
in the In-the-Wild dataset using t-SNE. Although difficulty in features are tested and analyzed compared to any other studies
discriminating between real and fake is revealed, the Hubert on the ADD task, including handcrafted features, learnable
feature is more distinguishable than the CQT feature space audio front-end, audio neural codec, audio pretrained model,
where the two categories completely overlap. speech pretrained model, and speech recognition model in
a total of 14 audio features. Experimental results show that
in the In-the-Wild dataset, features of the speech pretraining
4.2. Multi-view feature incorporation
models have good generalization performance while hand-
Based on results from single-view feature experiments, Hu- crafted features generalize poorly. The generalization perfor-
bert, XLS-R, and WavLM features that perform well in the mance of speech features on ADD task comes from the large
ASV2021 datasets are chosen as multi-view feature to further amount of pretraining data as well as the appropriate pretrain-
improve the generalizability of the detection system. ing task. We further improve the generalization ability of the
Table 2 shows results based on incorporating these three model based on the proposed feature selection and feature fu-
deep features on the In-the-Wild dataset. Compared to re- sion methods. The results show that these two methods can
sults either implemented in this work or results from stud- improve the generalizability compared to single features.
ies [20, 18], both proposed approaches are proved beneficial
to significantly improve the model generalizability where the 6. ACKNOWLEDGEMENT
EER reduces from 27.48, to 24.27 by the feature fusion, or to
25.98 by the feature selection. The effectiveness of feature se- We gratefully acknowledge the support of MindSpore, CANN
lection comes from a sample-aware mask mechanism, based (Compute Architecture for Neural Networks) and Ascend AI
on which each individual sample can select the most appro- Processor used for this research.
priate feature, while single-feature detection system provide
no feature selection space. The audio characteristic of the in- 7. REFERENCES
dividual sample is learned to form the mask, which is also
supervised by the detection task. This end-to-end approach [1] Massimiliano Todisco, Xin Wang, Ville Vestman,
guarantees such effectiveness. On the other hand, the success Md Sahidullah, Héctor Delgado, Andreas Nautsch, Ju-
of feature fusion indicates a complementary effect among the nichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and
selected three deep features. Each value in the fused feature Kong Aik Lee, “Asvspoof 2019: Future horizons

13134

Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.
in spoofed and fake audio detection,” arXiv preprint [12] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and
arXiv:1904.05441, 2019. Yossi Adi, “High fidelity neural audio compression,”
arXiv preprint arXiv:2210.13438, 2022.
[2] Junichi Yamagishi, Xin Wang, Massimiliano Todisco,
Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen [13] Yi-Chiao Wu, Israel D Gebru, Dejan Marković, and
Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Alexander Richard, “Audiodec: An open-source stream-
et al., “Asvspoof 2021: accelerating progress in ing high-fidelity neural audio codec,” in ICASSP
spoofed and deepfake speech detection,” arXiv preprint 2023-2023 IEEE International Conference on Acous-
arXiv:2109.00537, 2021. tics, Speech and Signal Processing (ICASSP). IEEE,
2023, pp. 1–5.
[3] Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin
Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, [14] Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski,
Ye Bai, Cunhang Fan, et al., “Add 2022: the first au- Michael Auli, Wojciech Galuba, Florian Metze, and
dio deep synthesis detection challenge,” in ICASSP Christoph Feichtenhofer, “Masked autoencoders that
2022-2022 IEEE International Conference on Acous- listen,” Advances in Neural Information Processing Sys-
tics, Speech and Signal Processing (ICASSP). IEEE, tems, vol. 35, pp. 28708–28720, 2022.
2022, pp. 9216–9220.
[15] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
[4] Rohan Kumar Das, Jichen Yang, and Haizhou Li, “Long Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrah-
range acoustic features for spoofed speech detection.,” man Mohamed, “Hubert: Self-supervised speech rep-
in Interspeech, 2019, pp. 1058–1062. resentation learning by masked prediction of hidden
[5] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- units,” IEEE/ACM Transactions on Audio, Speech, and
man, Christine McLeavey, and Ilya Sutskever, “Robust Language Processing, vol. 29, pp. 3451–3460, 2021.
speech recognition via large-scale weak supervision,” in
[16] Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
International Conference on Machine Learning. PMLR,
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
2023, pp. 28492–28518.
Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm:
[6] Piotr Kawa, Marcin Plata, Michał Czuba, Piotr Large-scale self-supervised pre-training for full stack
Szymański, and Piotr Syga, “Improved deepfake speech processing,” IEEE Journal of Selected Topics in
detection using whisper features,” arXiv preprint Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
arXiv:2306.01428, 2023.
[17] Xin Wang, Junichi Yamagishi, Massimiliano Todisco,
[7] Xin Wang and Junichi Yamagishi, “Investigating self- Héctor Delgado, Andreas Nautsch, Nicholas Evans,
supervised front ends for speech spoofing countermea- Md Sahidullah, Ville Vestman, Tomi Kinnunen,
sures,” arXiv preprint arXiv:2111.07725, 2021. Kong Aik Lee, et al., “Asvspoof 2019: A large-scale
public database of synthesized, converted and replayed
[8] Yuankun Xie, Haonan Cheng, Yutian Wang, and Long
speech,” Computer Speech & Language, vol. 64, pp.
Ye, “Learning A Self-Supervised Domain-Invariant
101114, 2020.
Feature Representation for Generalized Audio Deepfake
Detection,” in Proc. INTERSPEECH 2023, 2023, pp. [18] Nicolas M Müller, Pavel Czempin, Franziska Dieck-
2808–2812. mann, Adam Froghyar, and Konstantin Böttinger, “Does
[9] Hemlata Tak, Jose Patino, Massimiliano Todisco, An- audio deepfake detection generalize?,” arXiv preprint
dreas Nautsch, Nicholas Evans, and Anthony Larcher, arXiv:2203.16263, 2022.
“End-to-end anti-spoofing with rawnet2,” in ICASSP [19] Arun Babu, Changhan Wang, Andros Tjandra, Kushal
2021-2021 IEEE International Conference on Acous- Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
tics, Speech and Signal Processing (ICASSP). IEEE, Patrick von Platen, Yatharth Saraf, Juan Pino,
2021, pp. 6369–6373. et al., “Xls-r: Self-supervised cross-lingual speech
[10] Mirco Ravanelli and Yoshua Bengio, “Speaker recogni- representation learning at scale,” arXiv preprint
tion from raw waveform with sincnet,” in 2018 IEEE arXiv:2111.09296, 2021.
spoken language technology workshop (SLT). IEEE, [20] Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiao-
2018, pp. 1021–1028. hui Zhang, Chu Yuan Zhang, and Yan Zhao, “Au-
[11] Neil Zeghidour, Olivier Teboul, Félix de Chaumont dio deepfake detection: A survey,” arXiv preprint
Quitry, and Marco Tagliasacchi, “Leaf: A learn- arXiv:2308.14970, 2023.
able frontend for audio classification,” arXiv preprint
arXiv:2101.08596, 2021.

13135

Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.

You might also like