DEEPFAKE AUDIO DETECTION USING YOLOV8 WITH MEL-SPECTROGRAM ANALYSIS: A CROSS-DATASET EVALUATION
DOI:
https://doi.org/10.15588/1607-3274-2025-1-14Keywords:
deepfake detection, YOLOv8, mel-spectrogram, generalization capabilitiesAbstract
Context. The problem of detecting deepfake audio has become increasingly critical with the rapid advancement of voice synthesis technologies and their potential for misuse. Traditional audio processing methods face significant challenges in distinguishing sophisticated deepfakes, particularly when tested across different types of audio manipulations and datasets. The object of study is
developing a deepfake audio detection model that leverages mel-spectrograms as input to computer vision techniques, focusing on improving cross-dataset generalization capabilities.
Objective. The goal of the work is to improve the generalization capabilities of deepfake audio detection models by employing
mel-spectrograms and leveraging computer vision techniques. This is achieved by adapting YOLOv8, a state-of-the-art object detection model, for audio analysis and investigating the effectiveness of different mel-spectrogram representations across diverse datasets.
Method. A novel approach is proposed using YOLOv8 for deepfake audio detection through the analysis of two types of melspectrograms:
traditional and concatenated representations formed from SincConv filters. The method transforms audio signals into visual representations that can be processed by computer vision algorithms, enabling the detection of subtle patterns indicative of
synthetic speech. The proposed approach includes several key components: BCE loss optimization for binary classification, SGD with momentum (0.937) for efficient training, and comprehensive data augmentation techniques including random flips, translations, and HSV color augmentations. The SincConv filters cover a frequency range from 0 Hz to 8000 Hz, with a step size of approximately
533.33 Hz per filter, providing detailed frequency analysis capabilities. The effectiveness is evaluated using the EER metric across multiple datasets: ASVspoof 2021 LA (25,380 genuine and 121,461 spoofed utterances) for training, and ASVspoof 2021 DF,
Fake-or-Real (111,000 real and 87,000 synthetic utterances), In-the-Wild (17.2 hours fake, 20.7 hours real), and WaveFake (117,985
fake files) datasets for testing cross-dataset generalization.
Results. The experiments demonstrate varying effectiveness of different mel-spectrogram representations across datasets. Concatenated
mel-spectrograms showed superior performance on diverse, real-world datasets (In-the-Wild: 34.55% EER, Fake-or-Real:
35.3% EER), while simple mel-spectrograms performed better on more homogeneous datasets (ASVspoof DF: 28.99% EER, Wave-
Fake: 34.55% EER). Feature map visualizations reveal that the model’s attention patterns differ significantly between input types, with concatenated spectrograms showing more distributed focus across relevant regions for complex datasets. The training process, conducted over 50 epochs with a learning rate of 0.01 and warm-up strategy, demonstrated stable convergence and consistent performance
across multiple runs.
Conclusions. The experimental results confirm the viability of using YOLOv8 for deepfake audio detection and demonstrate that
the effectiveness of mel-spectrogram representations depends significantly on dataset characteristics. The findings suggest that input
representation should be selected based on the specific properties of the target audio data, with concatenated spectrograms being
more suitable for diverse, real-world scenarios and simple spectrograms for more controlled, homogeneous datasets. The study provides
a foundation for future research in adaptive representation selection and model optimization for deepfake audio detection.
References
Bondy M. Deepfakes, Digital Humans, and the Future of Entertainment in the Age of AI. Perkins Coie. URL: https://perkinscoie.com/insights/blog/deepfakes-digitalhumans-and-future-entertainment-age-ai (date of access: 30.10.2024).
Mcuba M., Singh A., Ikuesan R. A., Venter H. The Effect of Deep Learning Methods on Deepfake Audio Detection for Digital Investigation, Procedia Computer Science, 2023, № 219, P. 211–219. DOI: 10.1016/j.procs.2023.01.283
Salvi D. Liu H., Mandelli S., Bestagini P., Zhou W., Zhang W., Tubaro S. A Robust Approach to Multimodal Deepfake Detection, Journal of Imaging, 2023, Vol. 9, №6, P. 122.
DOI: 10.3390/jimaging9060122
Wang C. Yi J., Tao J., Sun H., Chen X., Tian Z., Ma H., Fan C., Fu R. Fully Automated End-to-End Fake Audio Detection, DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, 2022, pp. 27–33. DOI: 10.1145/3552466.3556530
Wang X., Yamagishi J. Investigating Active-learning-based Training Data Selection for Speech Spoofing Countermeasure, 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 585–592. DOI: 10.1109/SLT54892.2023.10023350
Liu X., Wang X., Sahidullah M., Patino J., Delgado H., Kinnunen T. ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, Vol. 31. pp. 2507–2522. DOI: 10.1109/taslp.2023.3285283.
Das R. K., Yang J., Li H. Long-range Acoustic and Deep Features Perspective on ASVspoof 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 1018–1025. DOI: 10.1109/ASRU46091.2019.9003845
Wang X., Yamagishi J. Investigating Self-supervised Front Ends for Speech Spoofing Countermeasures, The Speaker and Language Recognition Workshop, 2021, pp. 100–106.
DOI: 10.48550/arXiv.2111.07725
Xu L., Lee H., Chen Z., Chen X., Wang J. Modified Cepstral Feature for Speech Anti-spoofing, Journal of Dong Hua University (English Edition), 2023, Vol. 40, № 2, pp. 193–201. DOI: 10.19884/j.1672-5220.202205007.
Leon P., Stewart B., Yamagishi J. Synthetic Speech Discrimination using Pitch Pattern Statistics Derived from Image Analysis, Interspeech, 2012, pp. 370–373.
DOI: 10.21437/Interspeech.2012-135
Hong Y., Liu X., Tao J., Tian Z., Sun H. DNN Filter Bank Cepstral Coefficients for Spoofing Detection, IEEE Access, 2017, Vol. 5, pp. 4779–4787. DOI: 10.1109/ACCESS.2017.2687041
Schneider S., Baevski A., Collobert R., Auli M. Wav2Vec: Unsupervised Pre-training for Speech Recognition, Interspeech, 2019, pp. 1–9. DOI: 10.21437/interspeech.2019-1873
Babu A., Wang P., Mohamed A., Karthik M. XLS-R: Selfsupervised Cross-lingual Speech Representation Learning at Scale, Interspeech, 2021, pp. 2278–2282.
DOI: 10.21437/Interspeech.2022-143
Hsu W. N., Bolte B., Tsai Y., Salakhutdinov K., Mohamed T. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021. – Vol. 29. – P. 3451–3460.
Zbezhkhovska U., Khapilin O. Exploring Challenges and Future Paths in Deepfake Audio Detection, Proceedings of the 5th Masters Symposium MS-AMLV-2024. Lviv, Ukraine,
March 29–30, 2024, pp. 1–10.
López J. A. V., Roddy M. P., Kinnunen T., Tan Z. Spoofing Detection with DNN and One-class SVM for the ASVspoof 2015 Challenge, Interspeech, 2015.
DOI: 10.21437/Interspeech.2015-468
Amin T. B., German J. S., Marziliano P. Detecting Voice Disguise from Speech Variability: Analysis of Three Glottal and Vocal Tract Measures, Journal of the Acoustical Society of America, 2013, Vol. 20. DOI: 10.1121/1.4879257
Rodríguez-Ortega Y., Ballesteros L. D. M., Renza D. A Machine Learning Model to Detect Fake Voice, International Conference on Applied Informatics, 2020, Vol. 1277, pp. 3–13.
DOI: 10.1007/978-3-030-61702-8_1
Wu X., He R., Sun Z., Tan T. A Light CNN for Deep Face Representation with Noisy Labels, IEEE Transactions on Information Forensics and Security, 2018, Vol. 13, № 11, pp. 2884–2896. DOI: 10.1109/TIFS.2018.2833032
Alzantot M. F., Wang Z., Srivastava M. B. Deep Residual Neural Networks for Audio Spoofing Detection, Interspeech, 2019, pp. 1078–1082. DOI: 10.21437/Interspeech.2019-3174
Parasu P., Park K. K., Lee Y. C., Lee Y. Investigating LightResNet Architecture for Spoofing Detection under Mismatched Conditions, Interspeech, 2020, pp. 1111–1115.
DOI: 10.21437/Interspeech.2020-2039
Gao S. H. Cheng M., Zhao K., Zhang X., Cheng M. M., Heng P. A. Res2Net: A New Multi-scale Backbone Architecture, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021, Vol. 43, №2, pp. 652–662. DOI: 10.1109/TPAMI.2019.2938758
Tak H., Patino J., Todisco M., Evans N. End-to-end Antispoofing with RawNet2, ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021. – P. 6369–6373. DOI: 10.1109/ICASSP39728.2021.9414234
Lai C. I., Gao S. H., Chen Y. C. ASSERT: Antispoofing with Squeeze-Excitation and Residual Networks, Interspeech, 2019, pp. 1013–1017. DOI: 10.21437/Interspeech.2019-1794
Tak H., Patino J., Todisco M., Evans N. End-to-end Spectrotemporal Graph Attention Networks for Speaker Verification Anti-spoofing and Speech Deepfake Detection, Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 1–8. DOI: 10.21437/ASVSPOOF.2021-1.
Zeinali H., Mowlaee P., Patino J., Evans N. Detecting Spoofing Attacks using VGG and SincNet: Butomilia Submission to ASVspoof 2019 Challenge, Interspeech, 2019,
pp. 1073–1077. DOI:10.21437/interspeech.2019-2892
Liu X., Yang J., Sun H., Wang L. Leveraging Positionalrelated Local-global Dependency for Synthetic Speech Detection, Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
DOI: 10.1109/ICASSP49357.2023.10096278
Ismail A., Elpeltagy M. S., Zaki M. S., Eldahshan K. A. Deepfake Video Detection: YOLO-Face Convolution Recurrent Approach, PeerJ Computer Science, 2021, Vol. 7, pp. 2–19.
DOI: 10.7717/peerj-cs.730
Hubálovský Š., Trojovský P., Bacanin N., Venkatachalam Kv. Evaluation of Deepfake Detection using YOLO with Local Binary Pattern Histogram, PeerJ Computer Science, 2022. – Vol. 8. DOI: 10.7717/peerj-cs.1086.
Sohan M., Sai Ram T., Rami Reddy C. V. A Review on YOLOv8 and Its Advancements, Data Intelligence and Cognitive Informatics, 2024, pp. 529–545.
DOI: 10.1007/978-981-99-7962-2_39.
Redmon J., Farhadi A. YOLOv3: An Incremental Improvement, CoRR, arXiv preprint arXiv:1804.02767, 2018. DOI: 10.48550/arXiv.1804.02767
Terven J., Córdova-Esparza D. M., Romero-González J. A.
A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS, Machine Learning and Knowledge Extraction, 2023, Vol. 5, № 4, pp. 1680–1716. DOI: 10.3390/make5040083
Reis D., Kupec J., Hong J., Daoudi A. Real-Time Flying Object Detection with YOLOv8, arXiv preprint arXiv:2305.09972, 2024, pp. 1–10.
Ruder S. An Overview of Gradient Descent Optimization Algorithms, arXiv preprint arXiv:1609.04747, 2017, pp. 1–14.
Ravanelli M., Bengio Y. Speaker Recognition from Raw Waveform with SincNet, 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 1021–1028.
DOI: 10.1109/SLT.2018.8639585
Zbezhkhovska U. On Effectiveness and Generalization Capabilities of Deep Learning Models for Deepfake Audio Detection, Master Thesis. Ukrainian Catholic University, Faculty of Applied Sciences, Department of Computer Sciences. Lviv, 2024, 44 p.
Kinnunen T. H. Lee K. A., Tak H., Evans N. and Nautsch A. t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, Vol. 46, № 5, pp. 2622–2637.
DOI: 10.1109/TPAMI.2023.3313648
Müller N. M., Partel J., Kinnunen T. Does Audio Deepfake Detection Generalize?, Interspeech, 2022, pp. 2783–2787. DOI: 10.21437/Interspeech.2022-108
Reimao R., Tzerpos V. FoR: A Dataset for Synthetic Speech Detection, 2019 IEEE International Conference on Speech Technology and Human-Computer Dialogue (SpeD), 2019,
pp. 1–10. DOI: 10.1109/SpeD.2019.8878970
Frank J., Schönherr L. WaveFake: A Data Set to Facilitate Audio DeepFake Detection, 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021, pp. 1–17. DOI: 10.5281/zenodo.5642694
Zhong Z., Zheng L., Kang G., Li S., Yang Y. Random Erasing Data Augmentation, The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020, pp. 13001–13008.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 U. R. Zbezhkhovska

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.