Han Cong, Mesgarani Nima
Department of Electrical Engineering, Columbia University, New York, NY.
Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10095695. Epub 2023 May 5.
Binaural speech separation in real-world scenarios often involves moving speakers. Most current speech separation methods use utterance-level permutation invariant training (u-PIT) for training. In inference time, however, the order of outputs can be inconsistent over time particularly in long-form speech separation. This situation which is referred to as the speaker swap problem is even more problematic when speakers constantly move in space and therefore poses a challenge for consistent placement of speakers in output channels. Here, we describe a real-time binaural speech separation model based on a Wavesplit network to mitigate the speaker swap problem for moving speaker separation. Our model computes a speaker embedding for each speaker at each time frame from the mixed audio, aggregates embeddings using online clustering, and uses cluster centroids as speaker profiles to track each speaker throughout the long duration. Experimental results on reverberant, long-form moving multitalker speech separation show that the proposed method is less prone to speaker swap and achieves comparable performance with u-PIT based models with ground truth tracking in both separation accuracy and preserving the interaural cues.
在现实场景中的双耳语音分离通常涉及移动的说话者。当前大多数语音分离方法在训练时使用 utterance-level 排列不变训练(u-PIT)。然而,在推理阶段,输出顺序可能会随时间不一致,特别是在长语音分离中。这种被称为说话者交换问题的情况,当说话者在空间中不断移动时会更成问题,因此对在输出通道中一致地放置说话者构成了挑战。在此,我们描述了一种基于 Wavesplit 网络的实时双耳语音分离模型,以减轻用于移动说话者分离的说话者交换问题。我们的模型从混合音频中为每个说话者在每个时间帧计算一个说话者嵌入,使用在线聚类聚合嵌入,并使用聚类中心作为说话者轮廓来在长时间内跟踪每个说话者。在有混响的长时移动多说话者语音分离上的实验结果表明,所提出的方法不太容易出现说话者交换,并且在分离精度和保留双耳线索方面与基于 u-PIT 且有真实值跟踪的模型具有可比的性能。