DEEPFAKE AUDIO DETECTION USING YOLOV8 WITH MEL-SPECTROGRAM ANALYSIS: A CROSS-DATASET EVALUATION

U. R. Zbezhkhovska

doi:10.15588/1607-3274-2025-1-14

Authors

U. R. Zbezhkhovska Ivan Kozhedub Kharkiv National Air Force University, Kharkiv, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2025-1-14

Keywords:

deepfake detection, YOLOv8, mel-spectrogram, generalization capabilities

Abstract

Context. The problem of detecting deepfake audio has become increasingly critical with the rapid advancement of voice synthesis technologies and their potential for misuse. Traditional audio processing methods face significant challenges in distinguishing sophisticated deepfakes, particularly when tested across different types of audio manipulations and datasets. The object of study is
developing a deepfake audio detection model that leverages mel-spectrograms as input to computer vision techniques, focusing on improving cross-dataset generalization capabilities.
Objective. The goal of the work is to improve the generalization capabilities of deepfake audio detection models by employing
mel-spectrograms and leveraging computer vision techniques. This is achieved by adapting YOLOv8, a state-of-the-art object detection model, for audio analysis and investigating the effectiveness of different mel-spectrogram representations across diverse datasets.
Method. A novel approach is proposed using YOLOv8 for deepfake audio detection through the analysis of two types of melspectrograms:
traditional and concatenated representations formed from SincConv filters. The method transforms audio signals into visual representations that can be processed by computer vision algorithms, enabling the detection of subtle patterns indicative of
synthetic speech. The proposed approach includes several key components: BCE loss optimization for binary classification, SGD with momentum (0.937) for efficient training, and comprehensive data augmentation techniques including random flips, translations, and HSV color augmentations. The SincConv filters cover a frequency range from 0 Hz to 8000 Hz, with a step size of approximately
533.33 Hz per filter, providing detailed frequency analysis capabilities. The effectiveness is evaluated using the EER metric across multiple datasets: ASVspoof 2021 LA (25,380 genuine and 121,461 spoofed utterances) for training, and ASVspoof 2021 DF,
Fake-or-Real (111,000 real and 87,000 synthetic utterances), In-the-Wild (17.2 hours fake, 20.7 hours real), and WaveFake (117,985
fake files) datasets for testing cross-dataset generalization.
Results. The experiments demonstrate varying effectiveness of different mel-spectrogram representations across datasets. Concatenated
mel-spectrograms showed superior performance on diverse, real-world datasets (In-the-Wild: 34.55% EER, Fake-or-Real:
35.3% EER), while simple mel-spectrograms performed better on more homogeneous datasets (ASVspoof DF: 28.99% EER, Wave-
Fake: 34.55% EER). Feature map visualizations reveal that the model’s attention patterns differ significantly between input types, with concatenated spectrograms showing more distributed focus across relevant regions for complex datasets. The training process, conducted over 50 epochs with a learning rate of 0.01 and warm-up strategy, demonstrated stable convergence and consistent performance
across multiple runs.
Conclusions. The experimental results confirm the viability of using YOLOv8 for deepfake audio detection and demonstrate that
the effectiveness of mel-spectrogram representations depends significantly on dataset characteristics. The findings suggest that input
representation should be selected based on the specific properties of the target audio data, with concatenated spectrograms being
more suitable for diverse, real-world scenarios and simple spectrograms for more controlled, homogeneous datasets. The study provides
a foundation for future research in adaptive representation selection and model optimization for deepfake audio detection.

Author Biography

U. R. Zbezhkhovska, Ivan Kozhedub Kharkiv National Air Force University, Kharkiv

PhD, Leading Researcher of Scientific and Methodical Department for Quality Assurance of
Educational Activities and Higher Education

References

Bondy M. Deepfakes, Digital Humans, and the Future of Entertainment in the Age of AI. Perkins Coie. URL: https://perkinscoie.com/insights/blog/deepfakes-digitalhumans-and-future-entertainment-age-ai (date of access: 30.10.2024).

Mcuba M., Singh A., Ikuesan R. A., Venter H. The Effect of Deep Learning Methods on Deepfake Audio Detection for Digital Investigation, Procedia Computer Science, 2023, № 219, P. 211–219. DOI: 10.1016/j.procs.2023.01.283

Salvi D. Liu H., Mandelli S., Bestagini P., Zhou W., Zhang W., Tubaro S. A Robust Approach to Multimodal Deepfake Detection, Journal of Imaging, 2023, Vol. 9, №6, P. 122.

DOI: 10.3390/jimaging9060122

Wang C. Yi J., Tao J., Sun H., Chen X., Tian Z., Ma H., Fan C., Fu R. Fully Automated End-to-End Fake Audio Detection, DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, 2022, pp. 27–33. DOI: 10.1145/3552466.3556530

Wang X., Yamagishi J. Investigating Active-learning-based Training Data Selection for Speech Spoofing Countermeasure, 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 585–592. DOI: 10.1109/SLT54892.2023.10023350

Liu X., Wang X., Sahidullah M., Patino J., Delgado H., Kinnunen T. ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, Vol. 31. pp. 2507–2522. DOI: 10.1109/taslp.2023.3285283.

Das R. K., Yang J., Li H. Long-range Acoustic and Deep Features Perspective on ASVspoof 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 1018–1025. DOI: 10.1109/ASRU46091.2019.9003845

Wang X., Yamagishi J. Investigating Self-supervised Front Ends for Speech Spoofing Countermeasures, The Speaker and Language Recognition Workshop, 2021, pp. 100–106.

DOI: 10.48550/arXiv.2111.07725

Xu L., Lee H., Chen Z., Chen X., Wang J. Modified Cepstral Feature for Speech Anti-spoofing, Journal of Dong Hua University (English Edition), 2023, Vol. 40, № 2, pp. 193–201. DOI: 10.19884/j.1672-5220.202205007.

Leon P., Stewart B., Yamagishi J. Synthetic Speech Discrimination using Pitch Pattern Statistics Derived from Image Analysis, Interspeech, 2012, pp. 370–373.

DOI: 10.21437/Interspeech.2012-135

Hong Y., Liu X., Tao J., Tian Z., Sun H. DNN Filter Bank Cepstral Coefficients for Spoofing Detection, IEEE Access, 2017, Vol. 5, pp. 4779–4787. DOI: 10.1109/ACCESS.2017.2687041

Schneider S., Baevski A., Collobert R., Auli M. Wav2Vec: Unsupervised Pre-training for Speech Recognition, Interspeech, 2019, pp. 1–9. DOI: 10.21437/interspeech.2019-1873

Babu A., Wang P., Mohamed A., Karthik M. XLS-R: Selfsupervised Cross-lingual Speech Representation Learning at Scale, Interspeech, 2021, pp. 2278–2282.

DOI: 10.21437/Interspeech.2022-143

Hsu W. N., Bolte B., Tsai Y., Salakhutdinov K., Mohamed T. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021. – Vol. 29. – P. 3451–3460.

Zbezhkhovska U., Khapilin O. Exploring Challenges and Future Paths in Deepfake Audio Detection, Proceedings of the 5th Masters Symposium MS-AMLV-2024. Lviv, Ukraine,

March 29–30, 2024, pp. 1–10.

López J. A. V., Roddy M. P., Kinnunen T., Tan Z. Spoofing Detection with DNN and One-class SVM for the ASVspoof 2015 Challenge, Interspeech, 2015.

DOI: 10.21437/Interspeech.2015-468

Amin T. B., German J. S., Marziliano P. Detecting Voice Disguise from Speech Variability: Analysis of Three Glottal and Vocal Tract Measures, Journal of the Acoustical Society of America, 2013, Vol. 20. DOI: 10.1121/1.4879257

Rodríguez-Ortega Y., Ballesteros L. D. M., Renza D. A Machine Learning Model to Detect Fake Voice, International Conference on Applied Informatics, 2020, Vol. 1277, pp. 3–13.

DOI: 10.1007/978-3-030-61702-8_1

Wu X., He R., Sun Z., Tan T. A Light CNN for Deep Face Representation with Noisy Labels, IEEE Transactions on Information Forensics and Security, 2018, Vol. 13, № 11, pp. 2884–2896. DOI: 10.1109/TIFS.2018.2833032

Alzantot M. F., Wang Z., Srivastava M. B. Deep Residual Neural Networks for Audio Spoofing Detection, Interspeech, 2019, pp. 1078–1082. DOI: 10.21437/Interspeech.2019-3174

Parasu P., Park K. K., Lee Y. C., Lee Y. Investigating LightResNet Architecture for Spoofing Detection under Mismatched Conditions, Interspeech, 2020, pp. 1111–1115.

DOI: 10.21437/Interspeech.2020-2039

Gao S. H. Cheng M., Zhao K., Zhang X., Cheng M. M., Heng P. A. Res2Net: A New Multi-scale Backbone Architecture, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021, Vol. 43, №2, pp. 652–662. DOI: 10.1109/TPAMI.2019.2938758

Tak H., Patino J., Todisco M., Evans N. End-to-end Antispoofing with RawNet2, ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021. – P. 6369–6373. DOI: 10.1109/ICASSP39728.2021.9414234

Lai C. I., Gao S. H., Chen Y. C. ASSERT: Antispoofing with Squeeze-Excitation and Residual Networks, Interspeech, 2019, pp. 1013–1017. DOI: 10.21437/Interspeech.2019-1794

Tak H., Patino J., Todisco M., Evans N. End-to-end Spectrotemporal Graph Attention Networks for Speaker Verification Anti-spoofing and Speech Deepfake Detection, Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 1–8. DOI: 10.21437/ASVSPOOF.2021-1.

Zeinali H., Mowlaee P., Patino J., Evans N. Detecting Spoofing Attacks using VGG and SincNet: Butomilia Submission to ASVspoof 2019 Challenge, Interspeech, 2019,

pp. 1073–1077. DOI:10.21437/interspeech.2019-2892

Liu X., Yang J., Sun H., Wang L. Leveraging Positionalrelated Local-global Dependency for Synthetic Speech Detection, Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.

DOI: 10.1109/ICASSP49357.2023.10096278

Ismail A., Elpeltagy M. S., Zaki M. S., Eldahshan K. A. Deepfake Video Detection: YOLO-Face Convolution Recurrent Approach, PeerJ Computer Science, 2021, Vol. 7, pp. 2–19.

DOI: 10.7717/peerj-cs.730

Hubálovský Š., Trojovský P., Bacanin N., Venkatachalam Kv. Evaluation of Deepfake Detection using YOLO with Local Binary Pattern Histogram, PeerJ Computer Science, 2022. – Vol. 8. DOI: 10.7717/peerj-cs.1086.

Sohan M., Sai Ram T., Rami Reddy C. V. A Review on YOLOv8 and Its Advancements, Data Intelligence and Cognitive Informatics, 2024, pp. 529–545.

DOI: 10.1007/978-981-99-7962-2_39.

Redmon J., Farhadi A. YOLOv3: An Incremental Improvement, CoRR, arXiv preprint arXiv:1804.02767, 2018. DOI: 10.48550/arXiv.1804.02767

Terven J., Córdova-Esparza D. M., Romero-González J. A.

A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS, Machine Learning and Knowledge Extraction, 2023, Vol. 5, № 4, pp. 1680–1716. DOI: 10.3390/make5040083

Reis D., Kupec J., Hong J., Daoudi A. Real-Time Flying Object Detection with YOLOv8, arXiv preprint arXiv:2305.09972, 2024, pp. 1–10.

Ruder S. An Overview of Gradient Descent Optimization Algorithms, arXiv preprint arXiv:1609.04747, 2017, pp. 1–14.

Ravanelli M., Bengio Y. Speaker Recognition from Raw Waveform with SincNet, 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 1021–1028.

DOI: 10.1109/SLT.2018.8639585

Zbezhkhovska U. On Effectiveness and Generalization Capabilities of Deep Learning Models for Deepfake Audio Detection, Master Thesis. Ukrainian Catholic University, Faculty of Applied Sciences, Department of Computer Sciences. Lviv, 2024, 44 p.

Kinnunen T. H. Lee K. A., Tak H., Evans N. and Nautsch A. t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, Vol. 46, № 5, pp. 2622–2637.

DOI: 10.1109/TPAMI.2023.3313648

Müller N. M., Partel J., Kinnunen T. Does Audio Deepfake Detection Generalize?, Interspeech, 2022, pp. 2783–2787. DOI: 10.21437/Interspeech.2022-108

Reimao R., Tzerpos V. FoR: A Dataset for Synthetic Speech Detection, 2019 IEEE International Conference on Speech Technology and Human-Computer Dialogue (SpeD), 2019,

pp. 1–10. DOI: 10.1109/SpeD.2019.8878970

Frank J., Schönherr L. WaveFake: A Data Set to Facilitate Audio DeepFake Detection, 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021, pp. 1–17. DOI: 10.5281/zenodo.5642694

Zhong Z., Zheng L., Kang G., Li S., Yang Y. Random Erasing Data Augmentation, The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020, pp. 13001–13008.

DEEPFAKE AUDIO DETECTION USING YOLOV8 WITH MEL-SPECTROGRAM ANALYSIS: A CROSS-DATASET EVALUATION

Authors

DOI:

Keywords:

Abstract

Author Biography

U. R. Zbezhkhovska, Ivan Kozhedub Kharkiv National Air Force University, Kharkiv

References

Downloads

Published

How to Cite

Issue

Section

License

Creative Commons Licensing Notifications in the Copyright Notices

Information

Current Issue