Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA.
J Acoust Soc Am. 2020 Sep;148(3):1157. doi: 10.1121/10.0001779.
Speaker separation is a special case of speech separation, in which the mixture signal comprises two or more speakers. Many talker-independent speaker separation methods have been introduced in recent years to address this problem in anechoic conditions. To consider more realistic environments, this paper investigates talker-independent speaker separation in reverberant conditions. To effectively deal with speaker separation and speech dereverberation, extending the deep computational auditory scene analysis (CASA) approach to a two-stage system is proposed. In this method, reverberant utterances are first separated and separated utterances are then dereverberated. The proposed two-stage deep CASA system significantly outperforms a baseline one-stage deep CASA method in real reverberant conditions. The proposed system has superior separation performance at the frame level and higher accuracy in assigning separated frames to individual speakers. The proposed system successfully generalizes to an unseen speech corpus and exhibits similar performance to a talker-dependent system.
说话人分离是语音分离的一个特例,其中混合信号包含两个或更多说话人。近年来,已经提出了许多说话人无关的说话人分离方法来解决无声条件下的这个问题。为了考虑更现实的环境,本文研究了混响条件下的说话人无关的说话人分离。为了有效地处理说话人分离和语音去混响,本文将深度计算听觉场景分析(CASA)方法扩展到两阶段系统中。在该方法中,首先对混响语音进行分离,然后对分离的语音进行去混响。在真实混响条件下,所提出的两阶段深度 CASA 系统明显优于基线的单阶段深度 CASA 方法。该系统在帧级具有优越的分离性能,并且在将分离的帧分配给各个说话人时具有更高的准确性。该系统成功地推广到一个看不见的语音语料库,并表现出与说话人相关系统相似的性能。