Hao Fengyuan, Li Xiaodong, Zheng Chengshi
Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.
End-to-end neural diarization (EEND) which has the capability to directly output speaker diarization results and handle overlapping speech has attracted more and more attention due to its promising performance. Although existing EEND-based methods often outperform clustering-based methods, they cannot generalize well to unseen test sets because fixed attractors are often utilized to estimate speech activities of each speaker. An iterative adaptive attractor estimation (IAAE) network was proposed to refine diarization results, in which the self-attentive EEND (SA-EEND) was implemented to initialize diarization results and frame-wise embeddings. There are two main parts in the proposed IAAE network: an attention-based pooling was designed to obtain a rough estimation of the attractors based on the diarization results of the previous iteration, and an adaptive attractor was then calculated by using transformer decoder blocks. A unified training framework was proposed to further improve the diarization performance, making the embeddings more discriminable based on the well separated attractors. We evaluated the proposed method on both the simulated mixtures and the real CALLHOME dataset using the diarization error rate (DER). Our proposed method provides relative reductions in DER by up to 44.8% on simulated 2-speaker mixtures and 23.6% on the CALLHOME dataset over the baseline SA-EEND at the 2nd iteration step. We also demonstrated that with an increasing number of refinement steps applied, the DER on the CALLHOME dataset could be further reduced to 7.36%, achieving the state-of-the-art diarization results when compared with other methods.
端到端神经语音分离(EEND)能够直接输出说话人语音分离结果并处理重叠语音,因其出色的性能而受到越来越多的关注。尽管现有的基于EEND的方法通常优于基于聚类的方法,但由于经常使用固定吸引子来估计每个说话人的语音活动,它们无法很好地推广到未见测试集。提出了一种迭代自适应吸引子估计(IAAE)网络来优化语音分离结果,其中实现了自注意力EEND(SA-EEND)来初始化语音分离结果和逐帧嵌入。所提出的IAAE网络有两个主要部分:设计了基于注意力的池化,以根据上一次迭代的语音分离结果获得吸引子的粗略估计,然后使用Transformer解码器块计算自适应吸引子。提出了一个统一的训练框架来进一步提高语音分离性能,使嵌入基于分离良好的吸引子更具辨别力。我们使用语音分离错误率(DER)在模拟混合语音和真实CALLHOME数据集上评估了所提出的方法。在第二次迭代步骤中,我们提出的方法在模拟的双说话人混合语音上使DER相对降低了44.8%,在CALLHOME数据集上相对降低了23.6%。我们还证明,随着应用的细化步骤数量增加,CALLHOME数据集上的DER可以进一步降低到7.36%,与其他方法相比达到了当前最优的语音分离结果。