用于语音分离的深度说话人嵌入组合

Combination of deep speaker embeddings for diarisation.

作者信息

Sun Guangzhi, Zhang Chao, Woodland Philip C

机构信息

Cambridge University Engineering Department, Trumpington Street, Cambridge, CB2 1PZ, UK.

出版信息

Neural Netw. 2021 Sep;141:372-384. doi: 10.1016/j.neunet.2021.04.020. Epub 2021 Apr 21.

DOI:10.1016/j.neunet.2021.04.020

PMID:33984663

Abstract

Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components. Three structures are used to implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear pooling structures, relying on attention mechanisms, a gating mechanism, and a low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based single-pass speaker diarisation pipeline is also proposed in this paper, which uses NNs to achieve voice activity detection, speaker change point detection, and speaker embedding extraction. Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets which consist of real meetings with 4-10 speakers and a wide range of acoustic conditions. For systems trained on the AMI training set, relative speaker error rate (SER) reductions of 13% and 29% are obtained by using c-vectors instead of d-vectors on the AMI dev and eval sets respectively, and a relative SER reduction of 15% in SER is observed on RT05, which shows the robustness of the proposed methods. By incorporating VoxCeleb data into the training set, the best c-vector system achieved 7%, 17% and 16% relative SER reduction compared to the d-vector on the AMI dev, eval and RT05 sets respectively.

摘要

在将d向量作为从神经网络（NN）说话人分类器中提取的说话人嵌入用于语音片段聚类之后，说话人聚类最近取得了重大进展。为了提取性能更好、更稳健的说话人嵌入，本文提出了一种c向量方法，该方法通过组合从具有不同NN组件的系统中导出的多组互补d向量来实现。使用三种结构来实现c向量，即二维自注意力结构、门控加法结构和双线性池化结构，它们分别依赖于注意力机制、门控机制和低秩双线性池化机制。此外，本文还提出了一种基于神经网络的单通道说话人聚类流程，该流程使用神经网络来实现语音活动检测、说话人变化点检测和说话人嵌入提取。在具有挑战性的AMI和NIST RT05数据集上进行了实验和详细分析，这些数据集由包含4至10名说话人的真实会议以及广泛的声学条件组成。对于在AMI训练集上训练的系统，在AMI开发集和评估集上分别使用c向量代替d向量时，相对说话人错误率（SER）降低了分别为13%和29%，并且在RT05上观察到SER相对降低了15%，这表明了所提出方法的稳健性。通过将VoxCeleb数据纳入训练集，最佳的c向量系统与AMI开发集、评估集和RT05集上的d向量相比，相对SER分别降低了7%、17%和16%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于语音分离的深度说话人嵌入组合

Combination of deep speaker embeddings for diarisation.

作者信息

机构信息

出版信息

相似文献

用于语音分离的深度说话人嵌入组合

Combination of deep speaker embeddings for diarisation.

作者信息

机构信息

出版信息

相似文献