Artificial Intelligence Technology & Systems, MIT Lincoln Laboratory, Lexington, MA 02421, USA.
Artificial Intelligence Technology & Systems, MIT Lincoln Laboratory, Lexington, MA 02421, USA.
Neural Netw. 2021 Aug;140:136-147. doi: 10.1016/j.neunet.2021.02.020. Epub 2021 Mar 4.
Future wearable technology may provide for enhanced communication in noisy environments and for the ability to pick out a single talker of interest in a crowded room simply by the listener shifting their attentional focus. Such a system relies on two components, speaker separation and decoding the listener's attention to acoustic streams in the environment. To address the former, we present a system for joint speaker separation and noise suppression, referred to as the Binaural Enhancement via Attention Masking Network (BEAMNET). The BEAMNET system is an end-to-end neural network architecture based on self-attention. Binaural input waveforms are mapped to a joint embedding space via a learned encoder, and separate multiplicative masking mechanisms are included for noise suppression and speaker separation. Pairs of output binaural waveforms are then synthesized using learned decoders, each capturing a separated speaker while maintaining spatial cues. A key contribution of BEAMNET is that the architecture contains a separation path, an enhancement path, and an autoencoder path. This paper proposes a novel loss function which simultaneously trains these paths, so that disabling the masking mechanisms during inference causes BEAMNET to reconstruct the input speech signals. This allows dynamic control of the level of suppression applied by BEAMNET via a minimum gain level, which is not possible in other state-of-the-art approaches to end-to-end speaker separation. This paper also proposes a perceptually-motivated waveform distance measure. Using objective speech quality metrics, the proposed system is demonstrated to perform well at separating two equal-energy talkers, even in high levels of background noise. Subjective testing shows an improvement in speech intelligibility across a range of noise levels, for signals with artificially added head-related transfer functions and background noise. Finally, when used as part of an auditory attention decoder (AAD) system using existing electroencephalogram (EEG) data, BEAMNET is found to maintain the decoding accuracy achieved with ideal speaker separation, even in severe acoustic conditions. These results suggest that this enhancement system is highly effective at decoding auditory attention in realistic noise environments, and could possibly lead to improved speech perception in a cognitively controlled hearing aid.
未来的可穿戴技术可能会提供增强的在嘈杂环境中的通信能力,并能够通过听众将注意力集中在环境中的声学流上来简单地挑选出感兴趣的单个说话者。这样的系统依赖于两个组件,即说话者分离和解码听众对环境中声学流的注意力。为了解决前者,我们提出了一种用于联合说话者分离和噪声抑制的系统,称为双耳增强注意力掩蔽网络(BEAMNET)。BEAMNET 系统是一种基于自注意力的端到端神经网络架构。双耳输入波形通过学习的编码器映射到联合嵌入空间,并包含单独的乘法掩蔽机制,用于噪声抑制和说话者分离。然后使用学习的解码器合成输出的对双耳波形,每个解码器捕获一个分离的说话者,同时保持空间线索。BEAMNET 的一个关键贡献是,该架构包含一个分离路径、一个增强路径和一个自动编码器路径。本文提出了一种新的损失函数,该函数同时训练这些路径,以便在推理过程中禁用掩蔽机制会导致 BEAMNET 重建输入语音信号。这允许通过最小增益水平动态控制 BEAMNET 应用的抑制水平,这在其他端到端说话者分离的最新方法中是不可能的。本文还提出了一种基于感知的波形距离度量。使用客观的语音质量指标,所提出的系统在分离两个等能量说话者方面表现良好,即使在高背景噪声水平下也是如此。主观测试表明,在添加了人工头相关传递函数和背景噪声的信号中,在一系列噪声水平下,语音可懂度都有所提高。最后,当作为使用现有脑电图(EEG)数据的听觉注意力解码器(AAD)系统的一部分使用时,BEAMNET 被发现即使在严重的声学条件下,也能保持与理想说话者分离相同的解码精度。这些结果表明,这种增强系统在现实噪声环境中解码听觉注意力非常有效,并且可能会导致认知控制助听器中的语音感知得到改善。