College of Computer Science and Cyber Security (Oxford Brookes College), Chengdu University of Technology, Chengdu 610059, China.
Artificial Intelligence Research Center, Chengdu University of Technology, Chengdu 610059, China.
Sensors (Basel). 2023 Jan 20;23(3):1203. doi: 10.3390/s23031203.
In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods.
在说话人识别任务中,基于卷积神经网络(CNN)的方法已经取得了显著的成功。建模长期上下文并有效地聚合信息是说话人识别中的两个挑战,它们对系统性能有至关重要的影响。以前的研究通过引入更深、更宽、更复杂的网络架构和聚合方法来解决这些问题。然而,这些方法很难显著提高性能,因为它们也难以充分利用全局信息、通道信息和时频信息。为了解决上述问题,我们提出了一种更轻、更高效的基于 CNN 的端到端说话人识别架构,ResSKNet-SSDP。ResSKNet-SSDP 由残差选择核网络(ResSKNet)和自注意力标准差池化(SSDP)组成。ResSKNet 可以捕获长期上下文、邻域信息和全局信息,从而提取更具信息量的帧级特征。SSDP 可以捕获帧级特征的短期和长期变化,将变长的帧级特征聚合为固定长度、更具区分度的话语级特征。在两个流行的公共说话人识别数据集 Voxceleb 和 CN-Celeb 上进行了广泛的对比实验,与当前最先进的说话人识别系统进行了比较,实现了最低的 EER/DCF 为 2.33%/0.2298、2.44%/0.2559、4.10%/0.3502 和 12.28%/0.5051。与最轻的 x 向量相比,我们设计的 ResSKNet-SSDP 参数减少了 3.1M,推理时间减少了 31.6ms,但性能提高了 35.1%。结果表明,ResSKNet-SSDP 在所有测试集上的性能均显著优于当前最先进的说话人识别架构,是一种端到端架构,具有较少的参数和更高的效率,适用于实际情况。消融实验进一步表明,我们提出的方法也比以前的方法有显著的改进。