Zhang Bowen, Cui Hui, Nguyen Van, Whitty Monica
Department of Software Systems and Cybersecurity, Faculty of IT, Monash University, Melbourne, VIC 3800, Australia.
Sensors (Basel). 2025 Mar 22;25(7):1989. doi: 10.3390/s25071989.
Advancements in audio synthesis and manipulation technologies have reshaped applications such as personalised virtual assistants, voice cloning for creative content, and language learning tools. However, the misuse of these technologies to create audio deepfakes has raised serious concerns about security, privacy, and trust. Studies reveal that human judgement of deepfake audio is not always reliable, highlighting the urgent need for robust detection technologies to mitigate these risks. This paper provides a comprehensive survey of recent advancements in audio deepfake detection, with a focus on cutting-edge developments in the past few years. It begins by exploring the foundational methods of audio deepfake generation, including text-to-speech (TTS) and voice conversion (VC), followed by a review of datasets driving progress in the field. The survey then delves into detection approaches, covering frontend feature extraction, backend classification models, and end-to-end systems. Additionally, emerging topics such as privacy-preserving detection, explainability, and fairness are discussed. Finally, this paper identifies key challenges and outlines future directions for developing robust and scalable audio deepfake detection systems.
音频合成与处理技术的进步重塑了个性化虚拟助手、用于创作内容的语音克隆以及语言学习工具等应用。然而,滥用这些技术来创建音频深度伪造引发了对安全、隐私和信任的严重担忧。研究表明,人类对深度伪造音频的判断并不总是可靠的,这凸显了迫切需要强大的检测技术来减轻这些风险。本文全面综述了音频深度伪造检测的最新进展,重点关注过去几年的前沿发展。文章首先探讨了音频深度伪造生成的基础方法,包括文本转语音(TTS)和语音转换(VC),接着回顾了推动该领域进展的数据集。然后,该综述深入研究了检测方法,涵盖前端特征提取、后端分类模型和端到端系统。此外,还讨论了隐私保护检测、可解释性和公平性等新兴话题。最后,本文确定了关键挑战,并概述了开发强大且可扩展的音频深度伪造检测系统的未来方向。