Luo Yi, Chen Zhuo, Hershey John R, Le Roux Jonathan, Mesgarani Nima
Department of Electrical Engineering, Columbia University, New York, NY.
Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA.
Proc IEEE Int Conf Acoust Speech Signal Process. 2017 Mar;2017:61-65. doi: 10.1109/ICASSP.2017.7952118. Epub 2017 Jun 19.
Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known about its effectiveness in other challenging situations such as music source separation. Contrary to conventional networks that directly estimate the source signals, deep clustering generates an embedding for each time-frequency bin, and separates sources by clustering the bins in the embedding space. We show that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation, presumably because its more flexible objective engenders better regularization. Since the strengths of deep clustering and conventional network architectures appear complementary, we explore combining them in a single hybrid network trained via an approach akin to multi-task learning. Remarkably, the combination significantly outperforms either of its components.
深度聚类是处理具有相同类型多个声源和任意数量声源的一般音频分离场景的第一种方法,在与说话者无关的语音分离任务中表现出色。然而,对于其在其他具有挑战性的情况(如音乐源分离)中的有效性,人们了解甚少。与直接估计源信号的传统网络不同,深度聚类为每个时频仓生成一个嵌入,并通过在嵌入空间中对这些仓进行聚类来分离声源。我们表明,即使传统网络具有端到端训练以实现最佳信号近似的优势,但在匹配和不匹配条件下的歌声分离任务中,深度聚类的表现均优于传统网络,这可能是因为其更灵活的目标带来了更好的正则化。由于深度聚类和传统网络架构的优势似乎具有互补性,我们探索通过类似于多任务学习的方法将它们组合在一个单一的混合网络中进行训练。值得注意的是,这种组合的性能明显优于其任何一个组件。