用于无监督时空说话人分离的多感官融合

Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.

作者信息

Xylogiannis Paris, Vryzas Nikolaos, Vrysis Lazaros, Dimoulas Charalampos

机构信息

Multidisciplinary Media & Mediated Communication Research Group (M3C), Aristotle University, 54636 Thessaloniki, Greece.

出版信息

Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.

DOI:10.3390/s24134229

PMID:39001008

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11243859/

Abstract

Speaker diarization consists of answering the question of "who spoke when" in audio recordings. In meeting scenarios, the task of labeling audio with the corresponding speaker identities can be further assisted by the exploitation of spatial features. This work proposes a framework designed to assess the effectiveness of combining speaker embeddings with Time Difference of Arrival (TDOA) values from available microphone sensor arrays in meetings. We extract speaker embeddings using two popular and robust pre-trained models, ECAPA-TDNN and X-vectors, and calculate the TDOA values via the Generalized Cross-Correlation (GCC) method with Phase Transform (PHAT) weighting. Although ECAPA-TDNN outperforms the Xvectors model, we utilize both speaker embedding models to explore the potential of employing a computationally lighter model when spatial information is exploited. Various techniques for combining the spatial-temporal information are examined in order to determine the best clustering method. The proposed framework is evaluated on two multichannel datasets: the AVLab Speaker Localization dataset and a multichannel dataset (SpeaD-M3C) enriched in the context of the present work with supplementary information from smartphone recordings. Our results strongly indicate that the integration of spatial information can significantly improve the performance of state-of-the-art deep learning diarization models, presenting a 2-3% reduction in DER compared to the baseline approach on the evaluated datasets.

摘要

说话人聚类旨在回答录音中“谁在何时说话”的问题。在会议场景中，利用空间特征可以进一步辅助将音频标记为相应说话人身份的任务。这项工作提出了一个框架，旨在评估将说话人嵌入与会议中可用麦克风传感器阵列的到达时间差（TDOA）值相结合的有效性。我们使用两种流行且强大的预训练模型ECAPA-TDNN和X-向量来提取说话人嵌入，并通过带相位变换（PHAT）加权的广义互相关（GCC）方法计算TDOA值。尽管ECAPA-TDNN优于X-向量模型，但我们使用这两种说话人嵌入模型来探索在利用空间信息时采用计算量较小的模型的潜力。研究了各种用于组合时空信息的技术，以确定最佳的聚类方法。在两个多通道数据集上对所提出的框架进行了评估：AVLab说话人定位数据集和一个在本工作背景下通过智能手机录音的补充信息丰富的多通道数据集（SpeaD-M3C）。我们的结果有力地表明，空间信息的整合可以显著提高当前深度学习聚类模型的性能，与评估数据集上的基线方法相比，DER降低了2-3%。