Muruganandham Priyanka, Thangasamy Govardhana Rajan, Jayaraman Sangeetha, Dharmarajan Rekha
Department of CSE, Srinivasa Ramanujan Centre, SASTRA Deemed to be University, Kumbakonam, 612 001, India.
Sci Rep. 2025 Jul 2;15(1):23514. doi: 10.1038/s41598-025-08198-6.
With the rapid advancement of synthetic speech technologies, detecting deepfake audio has become essential for preventing impersonation and misinformation. This study aims to enhance detection performance by addressing limitations in existing models, such as temporal inconsistencies, weak contextual representation, and reconstruction loss. A novel framework, termed Long Short-Term Memory Auto-Encoder with Dynamic Residual Difference Encoding (LSTM-AE-DRDE), is proposed to overcome these challenges. The framework consists of two parallel modules: one leverages attention-enhanced LSTM with contrastive learning to highlight critical temporal cues, while the other amplifies real-vs-fake separability by computing residual differences across transformed audio variants. By integrating diverse speech features-including MFCC, temporal, prosodic, wavelet packet, and glottal parameters the model captures both low- and high-level audio characteristics. Experimental evaluation was carried out on five benchmark datasets (CVoice Fake, FoR, Deepfake Voice Recognition, ODSS, and CMFD), where the proposed model achieved classification accuracies of 97%, 90%, 96%, 97%, and 95%, respectively. Furthermore, when compared to eleven state-of-the-art methods, the proposed model demonstrates superior performance with an overall ROC-AUC of approximately 98%. In addition, a comprehensive feature-wise ablation study was conducted to assess the contribution of each feature set, confirming the robustness and reliability of the proposed framework.
随着合成语音技术的快速发展,检测深度伪造音频对于防止身份冒用和错误信息传播变得至关重要。本研究旨在通过解决现有模型中的局限性,如时间不一致性、弱上下文表示和重建损失,来提高检测性能。提出了一种名为具有动态残差差异编码的长短期记忆自动编码器(LSTM-AE-DRDE)的新颖框架来克服这些挑战。该框架由两个并行模块组成:一个利用带有对比学习的注意力增强LSTM来突出关键的时间线索,而另一个通过计算变换后的音频变体之间的残差差异来增强真假可分离性。通过整合多种语音特征,包括MFCC、时间、韵律、小波包和声学参数,该模型捕捉了低级和高级音频特征。在五个基准数据集(CVoice Fake、FoR、深度伪造语音识别、ODSS和CMFD)上进行了实验评估,所提出的模型在这些数据集上分别实现了97%、90%、96%、97%和95%的分类准确率。此外,与十一种先进方法相比,所提出的模型表现出卓越的性能,总体ROC-AUC约为98%。此外,还进行了全面的特征消融研究以评估每个特征集的贡献,证实了所提出框架的稳健性和可靠性。