• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于语音分离的深度说话人嵌入组合

Combination of deep speaker embeddings for diarisation.

作者信息

Sun Guangzhi, Zhang Chao, Woodland Philip C

机构信息

Cambridge University Engineering Department, Trumpington Street, Cambridge, CB2 1PZ, UK.

出版信息

Neural Netw. 2021 Sep;141:372-384. doi: 10.1016/j.neunet.2021.04.020. Epub 2021 Apr 21.

DOI:10.1016/j.neunet.2021.04.020
PMID:33984663
Abstract

Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components. Three structures are used to implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear pooling structures, relying on attention mechanisms, a gating mechanism, and a low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based single-pass speaker diarisation pipeline is also proposed in this paper, which uses NNs to achieve voice activity detection, speaker change point detection, and speaker embedding extraction. Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets which consist of real meetings with 4-10 speakers and a wide range of acoustic conditions. For systems trained on the AMI training set, relative speaker error rate (SER) reductions of 13% and 29% are obtained by using c-vectors instead of d-vectors on the AMI dev and eval sets respectively, and a relative SER reduction of 15% in SER is observed on RT05, which shows the robustness of the proposed methods. By incorporating VoxCeleb data into the training set, the best c-vector system achieved 7%, 17% and 16% relative SER reduction compared to the d-vector on the AMI dev, eval and RT05 sets respectively.

摘要

在将d向量作为从神经网络(NN)说话人分类器中提取的说话人嵌入用于语音片段聚类之后,说话人聚类最近取得了重大进展。为了提取性能更好、更稳健的说话人嵌入,本文提出了一种c向量方法,该方法通过组合从具有不同NN组件的系统中导出的多组互补d向量来实现。使用三种结构来实现c向量,即二维自注意力结构、门控加法结构和双线性池化结构,它们分别依赖于注意力机制、门控机制和低秩双线性池化机制。此外,本文还提出了一种基于神经网络的单通道说话人聚类流程,该流程使用神经网络来实现语音活动检测、说话人变化点检测和说话人嵌入提取。在具有挑战性的AMI和NIST RT05数据集上进行了实验和详细分析,这些数据集由包含4至10名说话人的真实会议以及广泛的声学条件组成。对于在AMI训练集上训练的系统,在AMI开发集和评估集上分别使用c向量代替d向量时,相对说话人错误率(SER)降低了分别为13%和29%,并且在RT05上观察到SER相对降低了15%,这表明了所提出方法的稳健性。通过将VoxCeleb数据纳入训练集,最佳的c向量系统与AMI开发集、评估集和RT05集上的d向量相比,相对SER分别降低了7%、17%和16%。

相似文献

1
Combination of deep speaker embeddings for diarisation.用于语音分离的深度说话人嵌入组合
Neural Netw. 2021 Sep;141:372-384. doi: 10.1016/j.neunet.2021.04.020. Epub 2021 Apr 21.
2
Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection.在用于抑郁症检测的多语音任务刺激中结合说话人嵌入的集成学习。
Front Neurosci. 2023 Mar 23;17:1141621. doi: 10.3389/fnins.2023.1141621. eCollection 2023.
3
H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model.H-VECTORS:使用分层注意力模型提高语句级说话人嵌入的鲁棒性。
Neural Netw. 2021 Oct;142:329-339. doi: 10.1016/j.neunet.2021.05.024. Epub 2021 May 25.
4
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization.生成对抗网络中用于说话人聚类的潜在空间聚类元学习
IEEE/ACM Trans Audio Speech Lang Process. 2021;29:1204-1219. doi: 10.1109/taslp.2021.3061885. Epub 2021 Feb 26.
5
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.使用深度神经嵌入的半监督音频驱动电视新闻说话人分割
J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.
6
Phonetic variability constrained bottleneck features for joint speaker recognition and physical task stress detection.用于联合说话人识别和身体任务压力检测的语音变异受限瓶颈特征
J Acoust Soc Am. 2020 Nov;148(5):2912. doi: 10.1121/10.0002455.
7
Manifestation of depression in speech overlaps with characteristics used to represent and recognize speaker identity.抑郁在言语中的表现与用于代表和识别说话者身份的特征重叠。
Sci Rep. 2023 Jul 10;13(1):11155. doi: 10.1038/s41598-023-35184-7.
8
Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network.基于深度卷积神经网络的特征选择算法对语音情感识别的影响。
Sensors (Basel). 2020 Oct 23;20(21):6008. doi: 10.3390/s20216008.
9
Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings.使用深度学习嵌入进行自动法医语音比对中的语言不匹配的影响。
J Forensic Sci. 2023 May;68(3):871-883. doi: 10.1111/1556-4029.15250. Epub 2023 Mar 31.
10
Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones.基于空气、骨和喉传声器的多模态说话人识别的递归图嵌入作为短段非线性特征。
Sci Rep. 2024 May 31;14(1):12513. doi: 10.1038/s41598-024-62406-3.