A Robust Audio Deepfake Detection System Via Multi-View Feature
A Robust Audio Deepfake Detection System Via Multi-View Feature
Yujie Yang†, Haochen Qin†, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han*, Yunhe Wang*
Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.
tures to further improve the performance and general- we propose two methods based on feature selection and fea-
izability of the system. ture fusion respectively.
ri = CA(Concat(fi ))
2.2. Multi-view feature incorporation (2)
Ffusion = TE(Concat({ri })), i ∈ [0 , N ]
Features extracted from different deep models contain unique
information, which can further boost ADD model generaliz- Where CA means channel-attention and TE means vallina
ability with proper feature incorporation methods. Therefore, Transformer encoder.
13132
Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.
3. EXPERIMENTS
Table 1. Performance of various single audio features on the
3.1. Datasets ADD task EER (%)
We train our models on the train and dev subsets of the ASVspoof19 ASVspoof21
Features In-the-Wild
ASVSpoof 2019 Logical Access (LA) dataset part [17], LA eval DF eval
which is consistent with most related works. To evaluate Mel 7.42 20.13 50.56
our systems, we adopt three datasets. The eval subset of MFCC 6.45 27.27 75.43
ASVspoof 2019 and 2021 challenge are used to test the per- LogSpec 5.67 20.62 52.93
formance within similar domains [2]. The spoof audio of LFCC 15.35 25.67 65.45
the ASVspoof challenge is generated by 11 TTS and 8 VC CQT 4.91 20.75 56.69
algorithms from VCTK corpus. The samples of its eval sub- LEAF 8.54 21.54 49.70
set is generated with different algorithms compared to the SincNet 6.12 20.78 56.74
train subset. To evaluate the generalization ability of our sys- EnCodec 10.25 24.93 39.44
tems, we also test our systems on In-the-Wild dataset, which AudioDec 10.47 26.13 43.69
contains 20.8 hours of real audio and 17.2 hours of deep- AudioMAE 11.07 30.47 75.40
fake audio[18]. The In-the-Wild dataset is collected from the XLS-R 2.07 11.78 29.19
Internet and consists of audio from various realistic scenarios. Hubert 6.78 14.76 27.48
WavLM 7.24 15.53 30.50
3.2. Implement details Whisper 5.59 23.28 42.73
13133
Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.
Table 2. The comparison of EER (%) score on In-the-Wild
dataset.
Model Features In-the-Wild
RawNet2[20] waveform 36.74
RawNet2[18] waveform 33.94
AASIST[20] waveform 34.81
ResNet34[20] XLS-R 46.35
LCNN[20] XLS-R 39.82
(a) (b) Res2Net[20] XLS-R 36.62
ResNet18(ours) XLS-R 29.19
Fig. 1. Visualization of CQT features and Hubert features for ResNet18(ours) Hubert 27.48
real and fake speech in the In-the-Wild dataset. ResNet18(ours) WavLM 30.50
Selection(ours) XLS-R,WavLM,Hubert 25.98
Fusion(ours) XLS-R,WavLM,Hubert 24.27
the pretrained dataset. AudioMAE is pretrained on the Au-
dioset dataset, which contains more universal audio than hu-
man speech, dispersing the ability to discriminate between attends to any other value not only across time and frequency
true and false human speech. The whisper feature also fails to dimension, but also feature dimension. So, the fused feature
generalize well, even though pretrained with more than 680k are better representation for the ResNet18 classifier to get the
hours of unlabeled speech data. This feature is obtained by best EER on the In-the-Wild dataset.
weakly supervised training on the ASR task, which focuses
more on speech content instead of audio signal information. 5. CONCLUSION
To visualize the superior generalizability of deep features
over hand-crafted features, Fig 4.1 shows the visualization of In this paper, we study the association between audio features
CQT features and Hubert features for real and fake speech and the generalizability of the ADD system. First, more audio
in the In-the-Wild dataset using t-SNE. Although difficulty in features are tested and analyzed compared to any other studies
discriminating between real and fake is revealed, the Hubert on the ADD task, including handcrafted features, learnable
feature is more distinguishable than the CQT feature space audio front-end, audio neural codec, audio pretrained model,
where the two categories completely overlap. speech pretrained model, and speech recognition model in
a total of 14 audio features. Experimental results show that
in the In-the-Wild dataset, features of the speech pretraining
4.2. Multi-view feature incorporation
models have good generalization performance while hand-
Based on results from single-view feature experiments, Hu- crafted features generalize poorly. The generalization perfor-
bert, XLS-R, and WavLM features that perform well in the mance of speech features on ADD task comes from the large
ASV2021 datasets are chosen as multi-view feature to further amount of pretraining data as well as the appropriate pretrain-
improve the generalizability of the detection system. ing task. We further improve the generalization ability of the
Table 2 shows results based on incorporating these three model based on the proposed feature selection and feature fu-
deep features on the In-the-Wild dataset. Compared to re- sion methods. The results show that these two methods can
sults either implemented in this work or results from stud- improve the generalizability compared to single features.
ies [20, 18], both proposed approaches are proved beneficial
to significantly improve the model generalizability where the 6. ACKNOWLEDGEMENT
EER reduces from 27.48, to 24.27 by the feature fusion, or to
25.98 by the feature selection. The effectiveness of feature se- We gratefully acknowledge the support of MindSpore, CANN
lection comes from a sample-aware mask mechanism, based (Compute Architecture for Neural Networks) and Ascend AI
on which each individual sample can select the most appro- Processor used for this research.
priate feature, while single-feature detection system provide
no feature selection space. The audio characteristic of the in- 7. REFERENCES
dividual sample is learned to form the mask, which is also
supervised by the detection task. This end-to-end approach [1] Massimiliano Todisco, Xin Wang, Ville Vestman,
guarantees such effectiveness. On the other hand, the success Md Sahidullah, Héctor Delgado, Andreas Nautsch, Ju-
of feature fusion indicates a complementary effect among the nichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and
selected three deep features. Each value in the fused feature Kong Aik Lee, “Asvspoof 2019: Future horizons
13134
Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.
in spoofed and fake audio detection,” arXiv preprint [12] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and
arXiv:1904.05441, 2019. Yossi Adi, “High fidelity neural audio compression,”
arXiv preprint arXiv:2210.13438, 2022.
[2] Junichi Yamagishi, Xin Wang, Massimiliano Todisco,
Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen [13] Yi-Chiao Wu, Israel D Gebru, Dejan Marković, and
Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Alexander Richard, “Audiodec: An open-source stream-
et al., “Asvspoof 2021: accelerating progress in ing high-fidelity neural audio codec,” in ICASSP
spoofed and deepfake speech detection,” arXiv preprint 2023-2023 IEEE International Conference on Acous-
arXiv:2109.00537, 2021. tics, Speech and Signal Processing (ICASSP). IEEE,
2023, pp. 1–5.
[3] Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin
Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, [14] Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski,
Ye Bai, Cunhang Fan, et al., “Add 2022: the first au- Michael Auli, Wojciech Galuba, Florian Metze, and
dio deep synthesis detection challenge,” in ICASSP Christoph Feichtenhofer, “Masked autoencoders that
2022-2022 IEEE International Conference on Acous- listen,” Advances in Neural Information Processing Sys-
tics, Speech and Signal Processing (ICASSP). IEEE, tems, vol. 35, pp. 28708–28720, 2022.
2022, pp. 9216–9220.
[15] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
[4] Rohan Kumar Das, Jichen Yang, and Haizhou Li, “Long Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrah-
range acoustic features for spoofed speech detection.,” man Mohamed, “Hubert: Self-supervised speech rep-
in Interspeech, 2019, pp. 1058–1062. resentation learning by masked prediction of hidden
[5] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- units,” IEEE/ACM Transactions on Audio, Speech, and
man, Christine McLeavey, and Ilya Sutskever, “Robust Language Processing, vol. 29, pp. 3451–3460, 2021.
speech recognition via large-scale weak supervision,” in
[16] Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
International Conference on Machine Learning. PMLR,
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
2023, pp. 28492–28518.
Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm:
[6] Piotr Kawa, Marcin Plata, Michał Czuba, Piotr Large-scale self-supervised pre-training for full stack
Szymański, and Piotr Syga, “Improved deepfake speech processing,” IEEE Journal of Selected Topics in
detection using whisper features,” arXiv preprint Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
arXiv:2306.01428, 2023.
[17] Xin Wang, Junichi Yamagishi, Massimiliano Todisco,
[7] Xin Wang and Junichi Yamagishi, “Investigating self- Héctor Delgado, Andreas Nautsch, Nicholas Evans,
supervised front ends for speech spoofing countermea- Md Sahidullah, Ville Vestman, Tomi Kinnunen,
sures,” arXiv preprint arXiv:2111.07725, 2021. Kong Aik Lee, et al., “Asvspoof 2019: A large-scale
public database of synthesized, converted and replayed
[8] Yuankun Xie, Haonan Cheng, Yutian Wang, and Long
speech,” Computer Speech & Language, vol. 64, pp.
Ye, “Learning A Self-Supervised Domain-Invariant
101114, 2020.
Feature Representation for Generalized Audio Deepfake
Detection,” in Proc. INTERSPEECH 2023, 2023, pp. [18] Nicolas M Müller, Pavel Czempin, Franziska Dieck-
2808–2812. mann, Adam Froghyar, and Konstantin Böttinger, “Does
[9] Hemlata Tak, Jose Patino, Massimiliano Todisco, An- audio deepfake detection generalize?,” arXiv preprint
dreas Nautsch, Nicholas Evans, and Anthony Larcher, arXiv:2203.16263, 2022.
“End-to-end anti-spoofing with rawnet2,” in ICASSP [19] Arun Babu, Changhan Wang, Andros Tjandra, Kushal
2021-2021 IEEE International Conference on Acous- Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
tics, Speech and Signal Processing (ICASSP). IEEE, Patrick von Platen, Yatharth Saraf, Juan Pino,
2021, pp. 6369–6373. et al., “Xls-r: Self-supervised cross-lingual speech
[10] Mirco Ravanelli and Yoshua Bengio, “Speaker recogni- representation learning at scale,” arXiv preprint
tion from raw waveform with sincnet,” in 2018 IEEE arXiv:2111.09296, 2021.
spoken language technology workshop (SLT). IEEE, [20] Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiao-
2018, pp. 1021–1028. hui Zhang, Chu Yuan Zhang, and Yan Zhao, “Au-
[11] Neil Zeghidour, Olivier Teboul, Félix de Chaumont dio deepfake detection: A survey,” arXiv preprint
Quitry, and Marco Tagliasacchi, “Leaf: A learn- arXiv:2308.14970, 2023.
able frontend for audio classification,” arXiv preprint
arXiv:2101.08596, 2021.
13135
Authorized licensed use limited to: National Institute of Technology Karnataka Surathkal. Downloaded on July 11,2024 at 05:11:55 UTC from IEEE Xplore. Restrictions apply.