Department of Electrical Engineering, Yuan Ze University, Taoyuan 320, Taiwan.
Department of Otolaryngology Head and Neck Surgery, Far Eastern Memorial Hospital, New Taipei City 220, Taiwan.
Sensors (Basel). 2022 Sep 2;22(17):6634. doi: 10.3390/s22176634.
Deep learning techniques such as convolutional neural networks (CNN) have been successfully applied to identify pathological voices. However, the major disadvantage of using these advanced models is the lack of interpretability in explaining the predicted outcomes. This drawback further introduces a bottleneck for promoting the classification or detection of voice-disorder systems, especially in this pandemic period. In this paper, we proposed using a series of learnable sinc functions to replace the very first layer of a commonly used CNN to develop an explainable SincNet system for classifying or detecting pathological voices. The applied sinc filters, a front-end signal processor in SincNet, are critical for constructing the meaningful layer and are directly used to extract the acoustic features for following networks to generate high-level voice information. We conducted our tests on three different Far Eastern Memorial Hospital voice datasets. From our evaluations, the proposed approach achieves the highest 7%-accuracy and 9%-sensitivity improvements from conventional methods and thus demonstrates superior performance in predicting input pathological waveforms of the SincNet system. More importantly, we intended to give possible explanations between the system output and the first-layer extracted speech features based on our evaluated results.
深度学习技术,如卷积神经网络 (CNN),已成功应用于识别病理性嗓音。然而,使用这些先进模型的主要缺点是缺乏可解释性来解释预测结果。这一缺陷进一步为促进嗓音障碍系统的分类或检测带来了瓶颈,尤其是在当前大流行期间。在本文中,我们提出使用一系列可学习的 sinc 函数来替代常用 CNN 的第一层,以开发一个可解释的 sincNet 系统,用于对病理性嗓音进行分类或检测。应用的 sinc 滤波器是 sincNet 的前端信号处理器,对于构建有意义的层至关重要,并且直接用于提取后续网络的声学特征,以生成高级别的语音信息。我们在三个不同的远东纪念医院语音数据集上进行了测试。从我们的评估结果来看,与传统方法相比,所提出的方法在预测 sincNet 系统的输入病理性波形方面取得了最高 7%的准确率和 9%的灵敏度提高,从而展示了卓越的性能。更重要的是,我们根据评估结果,试图在系统输出和提取的第一层语音特征之间给出可能的解释。