口罩类型对人工耳蜗降噪言语可懂度和质量训练目标的影响。

Impact of Mask Type as Training Target for Speech Intelligibility and Quality in Cochlear-Implant Noise Reduction.

机构信息

Department of Computing and Electronic Engineering, Atlantic Technological University, Ash Lane, F91YW50 Sligo, Ireland.

Electrical and Electronic Engineering, University of Galway, University Road, H91TK33 Galway, Ireland.

出版信息

Sensors (Basel). 2024 Oct 14;24(20):6614. doi: 10.3390/s24206614.

DOI:10.3390/s24206614

PMID:39460094

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11511210/

Abstract

The selection of a target when training deep neural networks for speech enhancement is an important consideration. Different masks have been shown to exhibit different performance characteristics depending on the application and the conditions. This paper presents a comprehensive comparison of several different masks for noise reduction in cochlear implants. The study incorporated three well-known masks, namely the Ideal Binary Mask (IBM), Ideal Ratio Mask (IRM) and the Fast Fourier Transform Mask (FFTM), as well as two newly proposed masks, based on existing masks, called the Quantized Mask (QM) and the Phase-Sensitive plus Ideal Ratio Mask (PSM+). These five masks are used to train networks to estimate masks for the purpose of separating speech from noisy mixtures. A vocoder was used to simulate the behavior of a cochlear implant. Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) scores indicate that the two new masks proposed in this study (QM and PSM+) perform best for normal speech intelligibility and quality in the presence of stationary and non-stationary noise over a range of signal-to-noise ratios (SNRs). The Normalized Covariance Measure (NCM) and similarity scores indicate that they also perform best for speech intelligibility/gauging the similarity of vocoded speech. The Quantized Mask performs better than the Ideal Binary Mask due to its better resolution as it approximates the Wiener Gain Function. The PSM+ performs better than the three existing benchmark masks (IBM, IRM, and FFTM) as it incorporates both magnitude and phase information.

摘要

在为语音增强训练深度神经网络时，选择目标是一个重要的考虑因素。不同的掩模在不同的应用和条件下表现出不同的性能特征。本文对几种不同的用于耳蜗植入物降噪的掩模进行了全面比较。该研究结合了三种著名的掩模，即理想二进制掩模（IBM）、理想比掩模（IRM）和快速傅里叶变换掩模（FFTM），以及两种基于现有掩模新提出的掩模，即量化掩模（QM）和相位敏感加理想比掩模（PSM+）。这五个掩模用于训练网络以估计掩模，以便将语音从噪声混合物中分离出来。声码器用于模拟耳蜗植入物的行为。短时客观可懂度（STOI）和语音质量感知评估（PESQ）得分表明，在存在固定和非固定噪声的情况下，本研究提出的两种新掩模（QM 和 PSM+）在各种信噪比（SNR）下对正常语音的可懂度和质量表现最佳。归一化协方差度量（NCM）和相似性得分表明，它们对语音可懂度/评估声码语音的相似性也表现最佳。由于其更好的分辨率，量化掩模比理想二进制掩模表现更好，因为它近似于维纳增益函数。PSM+比三个现有的基准掩模（IBM、IRM 和 FFTM）表现更好，因为它结合了幅度和相位信息。