• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于深度学习的说话人识别:综述。

Speaker recognition based on deep learning: An overview.

机构信息

Center of Intelligent Acoustics and Immersive Communications (CIAIC) and the School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an Shaanxi 710072, China.

出版信息

Neural Netw. 2021 Aug;140:65-99. doi: 10.1016/j.neunet.2021.03.004. Epub 2021 Mar 17.

DOI:10.1016/j.neunet.2021.03.004
PMID:33744714
Abstract

Speaker recognition is a task of identifying persons from their voices. Recently, deep learning has dramatically revolutionized speaker recognition. However, there is lack of comprehensive reviews on the exciting progress. In this paper, we review several major subtasks of speaker recognition, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods. Because the major advantage of deep learning over conventional methods is its representation ability, which is able to produce highly abstract embedding features from utterances, we first pay close attention to deep-learning-based speaker feature extraction, including the inputs, network structures, temporal pooling strategies, and objective functions respectively, which are the fundamental components of many speaker recognition subtasks. Then, we make an overview of speaker diarization, with an emphasis of recent supervised, end-to-end, and online diarization. Finally, we survey robust speaker recognition from the perspectives of domain adaptation and speech enhancement, which are two major approaches of dealing with domain mismatch and noise problems. Popular and recently released corpora are listed at the end of the paper.

摘要

说话人识别是从声音中识别说话人的任务。最近,深度学习极大地推动了说话人识别的发展。然而,缺乏对这一令人兴奋的进展的全面综述。在本文中,我们回顾了说话人识别的几个主要子任务,包括说话人验证、识别、声纹分割和鲁棒说话人识别,并重点介绍了基于深度学习的方法。由于深度学习相对于传统方法的主要优势在于其表示能力,它能够从话语中生成高度抽象的嵌入特征,因此我们首先关注基于深度学习的说话人特征提取,包括输入、网络结构、时间池化策略和目标函数,这些都是许多说话人识别子任务的基本组成部分。然后,我们对说话人声纹分割进行了概述,重点介绍了最近的监督、端到端和在线声纹分割。最后,我们从域自适应和语音增强的角度调查了鲁棒说话人识别,这是处理域不匹配和噪声问题的两种主要方法。最后列出了一些流行的和最近发布的语料库。

相似文献

1
Speaker recognition based on deep learning: An overview.基于深度学习的说话人识别:综述。
Neural Netw. 2021 Aug;140:65-99. doi: 10.1016/j.neunet.2021.03.004. Epub 2021 Mar 17.
2
Learning speaker-specific characteristics with a deep neural architecture.利用深度神经架构学习特定说话者的特征。
IEEE Trans Neural Netw. 2011 Nov;22(11):1744-56. doi: 10.1109/TNN.2011.2167240. Epub 2011 Sep 26.
3
Contrastive Speaker Representation Learning with Hard Negative Sampling for Speaker Recognition.基于硬负例采样的对比说话人表示学习在说话人识别中的应用。
Sensors (Basel). 2024 Sep 25;24(19):6213. doi: 10.3390/s24196213.
4
Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library.基于 PyAnnote 音频处理库的监督式说话人标注系统的开发。
Sensors (Basel). 2023 Feb 13;23(4):2082. doi: 10.3390/s23042082.
5
Speaker-turn aware diarization for speech-based cognitive assessments.用于基于语音的认知评估的说话轮次感知语音分离
Front Neurosci. 2024 Jan 16;17:1351848. doi: 10.3389/fnins.2023.1351848. eCollection 2023.
6
The Impact of Speaker Diarization on DNN-based Autism Severity Estimation.说话人分段对基于 DNN 的自闭症严重程度估计的影响。
Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul;2022:3414-3417. doi: 10.1109/EMBC48229.2022.9871523.
7
Self-Supervised Open-Set Speaker Recognition with Laguerre-Voronoi Descriptors.基于拉盖尔-沃罗诺伊描述符的自监督开集说话人识别
Sensors (Basel). 2024 Mar 21;24(6):1996. doi: 10.3390/s24061996.
8
Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation.基于Whisper分割的实时多语言语音识别与说话人识别系统。
PeerJ Comput Sci. 2024 Mar 29;10:e1973. doi: 10.7717/peerj-cs.1973. eCollection 2024.
9
Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.基于机器学习的方言阿萨姆语语音自动识别样本提取。
Neural Netw. 2016 Jun;78:97-111. doi: 10.1016/j.neunet.2015.12.010. Epub 2015 Dec 30.
10
Phonetic variability constrained bottleneck features for joint speaker recognition and physical task stress detection.用于联合说话人识别和身体任务压力检测的语音变异受限瓶颈特征
J Acoust Soc Am. 2020 Nov;148(5):2912. doi: 10.1121/10.0002455.

引用本文的文献

1
A systematic study of DNN based speech enhancement in reverberant and reverberant-noisy environments.在混响和混响噪声环境中基于深度神经网络的语音增强的系统研究。
Comput Speech Lang. 2025 Jan;89. doi: 10.1016/j.csl.2024.101677. Epub 2024 Jun 6.
2
Research on multi-branch residual connection spectrum image classification based on attention mechanism.基于注意力机制的多分支残差连接光谱图像分类研究
Sci Rep. 2025 Jul 15;15(1):25532. doi: 10.1038/s41598-025-11283-5.
3
Target sample mining with modified activation residual network for speaker verification.
基于改进激活残差网络的说话人验证目标样本挖掘
PLoS One. 2025 Apr 16;20(4):e0320256. doi: 10.1371/journal.pone.0320256. eCollection 2025.
4
Gene print-based cell subtypes annotation of human disease across heterogeneous datasets with gPRINT.使用gPRINT在异质数据集中基于基因印记的人类疾病细胞亚型注释
Protein Cell. 2025 Aug 7;16(8):685-704. doi: 10.1093/procel/pwaf001.
5
Leveraging laryngograph data for robust voicing detection in speech.利用喉语图数据进行稳健的语音发声检测。
J Acoust Soc Am. 2024 Nov 1;156(5):3502-3513. doi: 10.1121/10.0034445.
6
Contrastive Speaker Representation Learning with Hard Negative Sampling for Speaker Recognition.基于硬负例采样的对比说话人表示学习在说话人识别中的应用。
Sensors (Basel). 2024 Sep 25;24(19):6213. doi: 10.3390/s24196213.
7
A machine learning approach for vocal fold segmentation and disorder classification based on ensemble method.基于集成方法的声带分割和障碍分类的机器学习方法。
Sci Rep. 2024 Jun 23;14(1):14435. doi: 10.1038/s41598-024-64987-5.
8
Cortical-striatal brain network distinguishes deepfake from real speaker identity.皮质纹状体脑网络可区分深度伪造与真实说话人身份。
Commun Biol. 2024 Jun 11;7(1):711. doi: 10.1038/s42003-024-06372-6.
9
Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones.基于空气、骨和喉传声器的多模态说话人识别的递归图嵌入作为短段非线性特征。
Sci Rep. 2024 May 31;14(1):12513. doi: 10.1038/s41598-024-62406-3.
10
An Audio-Based SLAM for Indoor Environments: A Robotic Mixed Reality Presentation.用于室内环境的基于音频的同步定位与地图构建:机器人混合现实展示
Sensors (Basel). 2024 Apr 27;24(9):2796. doi: 10.3390/s24092796.