Tahmasebi Sina, Gajȩcki Tom, Nogueira Waldo
Department of Otolaryngology, Medical University Hannover and Cluster of Excellence "Hearing4all", Hanover, Germany.
Front Neurosci. 2020 May 19;14:434. doi: 10.3389/fnins.2020.00434. eCollection 2020.
A cochlear implant (CI) is a surgically implanted electronic device that partially restores hearing to people suffering from profound hearing loss. Although CI users, in general, obtain a very good reception of continuous speech in the absence of background noise, they face severe limitations in the context of music perception and appreciation. The main reasons for these limitations are related to channel interactions created by the broad spread of electrical fields in the cochlea and to the low number of electrodes that stimulate it. Moreover, CIs have severe limitations when it comes to transmitting the temporal fine structure of acoustic signals, and hence, these devices elicit poor pitch and timber perception. For these reasons, several signal processing algorithms have been proposed to make music more accessible for CI users, trying to reduce the complexity of music signals or remixing them to enhance certain components, such as the lead singing voice. In this work, a deep neural network that performs real-time audio source separation to remix music for CI users is presented. The implementation is based on multi-layer perception (MLP) and has been evaluated using objective instrumental measurements to ensure clean source estimation. Furthermore, experiments in 10 normal hearing (NH) and 13 CI users to investigate how the vocals to instruments ratio (VIR) set by the tested listeners were affected in realistic environments with and without visual information. The objective instrumental results fulfill the benchmark reported in previous studies by introducing distortions that are shown to not be perceived by CI users. Moreover, the implemented model was optimized to perform real-time source separation. The experimental results show that CI users prefer vocals 8 dB enhanced with the respect to the instruments independent of acoustic sound scenarios and visual information. In contrast, NH listeners did not prefer a VIR different than zero dB.
人工耳蜗(CI)是一种通过手术植入的电子设备,可部分恢复重度听力损失患者的听力。虽然一般来说,人工耳蜗使用者在没有背景噪音的情况下能够很好地接收连续语音,但在音乐感知和欣赏方面却面临严重限制。这些限制的主要原因与耳蜗中电场广泛分布所产生的通道相互作用以及刺激耳蜗的电极数量较少有关。此外,人工耳蜗在传输声学信号的时间精细结构方面存在严重限制,因此,这些设备在音高和音色感知方面表现不佳。由于这些原因,已经提出了几种信号处理算法,以使音乐对人工耳蜗使用者更易接受,试图降低音乐信号的复杂性或对其进行重新混音以增强某些成分,例如主唱声音。在这项工作中,提出了一种深度神经网络,该网络可进行实时音频源分离,以便为人工耳蜗使用者重新混音音乐。该实现基于多层感知器(MLP),并已通过客观仪器测量进行评估,以确保干净的源估计。此外,在10名正常听力(NH)者和13名人工耳蜗使用者中进行了实验,以研究在有视觉信息和无视觉信息的现实环境中,被测试听众设置的人声与乐器比例(VIR)如何受到影响。客观仪器测量结果通过引入人工耳蜗使用者未察觉到的失真,达到了先前研究报告的基准。此外,所实现的模型经过优化以执行实时源分离。实验结果表明,无论声学场景和视觉信息如何,人工耳蜗使用者更喜欢人声比乐器增强8 dB。相比之下,正常听力的听众不喜欢非零dB的人声与乐器比例。