Liu Yuzhou, Wang DeLiang
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA.
Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277 USA.
IEEE/ACM Trans Audio Speech Lang Process. 2019;27(12):2092-2102. doi: 10.1109/taslp.2019.2941148. Epub 2019 Sep 12.
We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.
我们从深度学习和计算听觉场景分析(CASA)的角度来探讨与说话者无关的单声道说话者分离问题。具体来说,我们将多说话者分离任务分解为同时分组和顺序分组两个阶段。首先,在每个时间帧中通过使用经过排列不变训练的神经网络分离不同说话者的频谱来执行同时分组。在第二阶段,通过聚类网络将帧级分离频谱顺序分组到不同说话者。所提出的深度CASA方法依次优化帧级分离和说话者跟踪,并在这两个目标上都产生了出色的结果。在基准WSJ0 - 2mix数据库上的实验结果表明,新方法在模型规模适中的情况下取得了当前最优的结果。