Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, 230601, China.
Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, 230601, China.
Neural Netw. 2024 Jul;175:106320. doi: 10.1016/j.neunet.2024.106320. Epub 2024 Apr 16.
The rhythm of bonafide speech is often difficult to replicate, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.
真实语音的节奏往往难以复制,这导致合成语音的基频 (F0) 与真实语音有明显的不同。预计 F0 特征包含用于伪造语音检测 (FSD) 任务的鉴别信息。在本文中,我们提出了一种用于 FSD 的新的 F0 子带。此外,为了有效地对 F0 子带进行建模,以提高 FSD 的性能,提出了空间重构局部注意 Res2Net (SR-LA Res2Net)。具体来说,Res2Net 被用作骨干网络来获取多尺度信息,并增强了空间重构机制,以避免在通道组不断叠加时丢失重要信息。此外,设计了局部注意机制,使模型能够关注 F0 子带的局部信息。在 ASVspoof 2019 LA 数据集上的实验结果表明,我们提出的方法在等错误率 (EER) 上达到 0.47%,最小串联检测代价函数 (min t-DCF) 达到 0.0159,在所有单系统中达到了最先进的性能。