ResSKNet-SSDP：一种高效、轻量级的端到端说话人识别架构。

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition.

机构信息

College of Computer Science and Cyber Security (Oxford Brookes College), Chengdu University of Technology, Chengdu 610059, China.

Artificial Intelligence Research Center, Chengdu University of Technology, Chengdu 610059, China.

出版信息

Sensors (Basel). 2023 Jan 20;23(3):1203. doi: 10.3390/s23031203.

DOI:10.3390/s23031203

PMID:36772243

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9920758/

Abstract

In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods.

摘要

在说话人识别任务中，基于卷积神经网络（CNN）的方法已经取得了显著的成功。建模长期上下文并有效地聚合信息是说话人识别中的两个挑战，它们对系统性能有至关重要的影响。以前的研究通过引入更深、更宽、更复杂的网络架构和聚合方法来解决这些问题。然而，这些方法很难显著提高性能，因为它们也难以充分利用全局信息、通道信息和时频信息。为了解决上述问题，我们提出了一种更轻、更高效的基于 CNN 的端到端说话人识别架构，ResSKNet-SSDP。ResSKNet-SSDP 由残差选择核网络（ResSKNet）和自注意力标准差池化（SSDP）组成。ResSKNet 可以捕获长期上下文、邻域信息和全局信息，从而提取更具信息量的帧级特征。SSDP 可以捕获帧级特征的短期和长期变化，将变长的帧级特征聚合为固定长度、更具区分度的话语级特征。在两个流行的公共说话人识别数据集 Voxceleb 和 CN-Celeb 上进行了广泛的对比实验，与当前最先进的说话人识别系统进行了比较，实现了最低的 EER/DCF 为 2.33%/0.2298、2.44%/0.2559、4.10%/0.3502 和 12.28%/0.5051。与最轻的 x 向量相比，我们设计的 ResSKNet-SSDP 参数减少了 3.1M，推理时间减少了 31.6ms，但性能提高了 35.1%。结果表明，ResSKNet-SSDP 在所有测试集上的性能均显著优于当前最先进的说话人识别架构，是一种端到端架构，具有较少的参数和更高的效率，适用于实际情况。消融实验进一步表明，我们提出的方法也比以前的方法有显著的改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fb30/9920758/b2f72a37e73e/sensors-23-01203-g001.jpg

相似文献

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition.

Sensors (Basel). 2023 Jan 20;23(3):1203. doi: 10.3390/s23031203.

D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.

Neural Netw. 2021 Jul;139:201-211. doi: 10.1016/j.neunet.2021.03.014. Epub 2021 Mar 18.

Attention-Based Temporal-Frequency Aggregation for Speaker Verification.

Sensors (Basel). 2022 Mar 10;22(6):2147. doi: 10.3390/s22062147.

End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis.

Comput Methods Programs Biomed. 2021 Nov;211:106433. doi: 10.1016/j.cmpb.2021.106433. Epub 2021 Sep 28.

Global and Local Knowledge-Aware Attention Network for Action Recognition.

IEEE Trans Neural Netw Learn Syst. 2021 Jan;32(1):334-347. doi: 10.1109/TNNLS.2020.2978613. Epub 2021 Jan 4.

A multiple-channel and atrous convolution network for ultrasound image segmentation.

Med Phys. 2020 Dec;47(12):6270-6285. doi: 10.1002/mp.14512. Epub 2020 Oct 18.

Evaluation of pooling operations in convolutional architectures for drug-drug interaction extraction.

BMC Bioinformatics. 2018 Jun 13;19(Suppl 8):209. doi: 10.1186/s12859-018-2195-1.

Which to select?: Analysis of speaker representation with graph attention networks.

J Acoust Soc Am. 2024 Oct 1;156(4):2701-2708. doi: 10.1121/10.0032393.

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network.

Sensors (Basel). 2020 Oct 23;20(21):6008. doi: 10.3390/s20216008.

Bidirectional Attention for Text-Dependent Speaker Verification.

Sensors (Basel). 2020 Nov 27;20(23):6784. doi: 10.3390/s20236784.

引用本文的文献

Smart Parking Locks Based on Extended UNET-GWO-SVM Algorithm.

Sensors (Basel). 2023 Oct 19;23(20):8572. doi: 10.3390/s23208572.

Special Issue on Acoustic Sensors and Their Applications (Vol. 1).

Sensors (Basel). 2023 Sep 7;23(18):7726. doi: 10.3390/s23187726.

本文引用的文献

Ensemble Approach on Deep and Handcrafted Features for Neonatal Bowel Sound Detection.

IEEE J Biomed Health Inform. 2023 Jun;27(6):2603-2613. doi: 10.1109/JBHI.2022.3217559. Epub 2023 Jun 5.

Attention-Based Temporal-Frequency Aggregation for Speaker Verification.

Sensors (Basel). 2022 Mar 10;22(6):2147. doi: 10.3390/s22062147.

D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.

Neural Netw. 2021 Jul;139:201-211. doi: 10.1016/j.neunet.2021.03.014. Epub 2021 Mar 18.

Squeeze-and-Excitation Networks.

IEEE Trans Pattern Anal Mach Intell. 2020 Aug;42(8):2011-2023. doi: 10.1109/TPAMI.2019.2913372. Epub 2019 Apr 29.

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition.

IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1437-1451. doi: 10.1109/TPAMI.2017.2711011. Epub 2017 Jun 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ResSKNet-SSDP：一种高效、轻量级的端到端说话人识别架构。

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition.

机构信息

College of Computer Science and Cyber Security (Oxford Brookes College), Chengdu University of Technology, Chengdu 610059, China.

Artificial Intelligence Research Center, Chengdu University of Technology, Chengdu 610059, China.

出版信息

Sensors (Basel). 2023 Jan 20;23(3):1203. doi: 10.3390/s23031203.

DOI:10.3390/s23031203

PMID:36772243

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9920758/

Abstract

摘要

ResSKNet-SSDP：一种高效、轻量级的端到端说话人识别架构。

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

ResSKNet-SSDP：一种高效、轻量级的端到端说话人识别架构。

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献