Li Yuang, Wang Yuntao, Liu Xin, Shi Yuanchun, Patel Shwetak, Shih Shao-Fu
Key Laboratory of Pervasive Computing, Ministry of Education, Department of Commputer Science and Technology, Tsinghua University, Beijing 100084, China.
Department of Engineering, University of Cambridge, Cambridge CB2 1TN, UK.
Sensors (Basel). 2022 Dec 20;23(1):35. doi: 10.3390/s23010035.
Voice communication using an air-conduction microphone in noisy environments suffers from the degradation of speech audibility. Bone-conduction microphones (BCM) are robust against ambient noises but suffer from limited effective bandwidth due to their sensing mechanism. Although existing audio super-resolution algorithms can recover the high-frequency loss to achieve high-fidelity audio, they require considerably more computational resources than is available in low-power hearable devices. This paper proposes the first-ever real-time on-chip speech audio super-resolution system for BCM. To accomplish this, we built and compared a series of lightweight audio super-resolution deep-learning models. Among all these models, ATS-UNet was the most cost-efficient because the proposed novel Audio Temporal Shift Module (ATSM) reduces the network's dimensionality while maintaining sufficient temporal features from speech audio. Then, we quantized and deployed the ATS-UNet to low-end ARM micro-controller units for a real-time embedded prototype. The evaluation results show that our system achieved real-time inference speed on Cortex-M7 and higher quality compared with the baseline audio super-resolution method. Finally, we conducted a user study with ten experts and ten amateur listeners to evaluate our method's effectiveness to human ears. Both groups perceived a significantly higher speech quality with our method when compared to the solutions with the original BCM or air-conduction microphone with cutting-edge noise-reduction algorithms.
在嘈杂环境中使用空气传导麦克风进行语音通信时,语音可听度会下降。骨传导麦克风(BCM)对环境噪声具有鲁棒性,但由于其传感机制,有效带宽有限。尽管现有的音频超分辨率算法可以恢复高频损失以实现高保真音频,但它们所需的计算资源比低功耗可听设备中可用的资源多得多。本文提出了首个用于骨传导麦克风的实时片上语音音频超分辨率系统。为此,我们构建并比较了一系列轻量级音频超分辨率深度学习模型。在所有这些模型中,ATS-UNet是最具成本效益的,因为所提出的新颖音频时间移位模块(ATSM)在保持语音音频足够时间特征的同时降低了网络的维度。然后,我们将ATS-UNet量化并部署到低端ARM微控制器单元上,以实现实时嵌入式原型。评估结果表明,与基线音频超分辨率方法相比,我们的系统在Cortex-M7上实现了实时推理速度,并且质量更高。最后,我们对十位专家和十位业余听众进行了用户研究,以评估我们的方法对人耳的有效性。与使用原始骨传导麦克风或采用前沿降噪算法的空气传导麦克风的解决方案相比,两组人员都认为我们的方法具有明显更高的语音质量。