National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China.
Sensors (Basel). 2022 Mar 10;22(6):2147. doi: 10.3390/s22062147.
Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.
卷积神经网络 (CNN) 因其强大的深度特征学习能力,极大地推动了说话人验证 (SV) 系统的发展。在基于 CNN 的 SV 系统中,话语级聚合是一个重要的组成部分,它将 CNN 前端生成的帧级特征压缩为话语级表示。然而,大多数现有的聚合方法在时间上进行特征聚合,无法捕捉到频域中包含的说话人相关信息。为了解决这个问题,本文提出了一种新颖的基于注意力的频率聚合方法,该方法关注为话语级表示提供更多信息的关键频带。同时,结合现有的时间聚合方法,提出了两种更有效的时频聚合方法。这两种提出的方法可以捕获帧级特征的时域和频域中包含的说话人相关信息,从而提高说话人嵌入的可辨别性。此外,还开发了一个基于强大 CNN 的 SV 系统,并在 TIMIT 和 Voxceleb 数据集上进行了评估。实验结果表明,与最先进的基线模型相比,基于 CNN 的 SV 系统在 Voxceleb 上使用时频聚合方法可实现更优的等错误率 5.96%。