Gupta Siddhant, Patil Ankur T, Purohit Mirali, Parmar Mihir, Patel Maitreya, Patil Hemant A, Guido Rodrigo Capobianco
Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India.
Arizona State University, Tempe, USA.
Neural Netw. 2021 Jul;139:105-117. doi: 10.1016/j.neunet.2021.02.008. Epub 2021 Feb 24.
Recently, we have witnessed Deep Learning methodologies gaining significant attention for severity-based classification of dysarthric speech. Detecting dysarthria, quantifying its severity, are of paramount importance in various real-life applications, such as the assessment of patients' progression in treatments, which includes an adequate planning of their therapy and the improvement of speech-based interactive systems in order to handle pathologically-affected voices automatically. Notably, current speech-powered tools often deal with short-duration speech segments and, consequently, are less efficient in dealing with impaired speech, even by using Convolutional Neural Networks (CNNs). Thus, detecting dysarthria severity-level based on short speech segments might help in improving the performance and applicability of those systems. To achieve this goal, we propose a novel Residual Network (ResNet)-based technique which receives short-duration speech segments as input. Statistically meaningful objective analysis of our experiments, reported over standard Universal Access corpus, exhibits average values of 21.35% and 22.48% improvement, compared to the baseline CNN, in terms of classification accuracy and F1-score, respectively. For additional comparisons, tests with Gaussian Mixture Models and Light CNNs were also performed. Overall, the values of 98.90% and 98.00% for classification accuracy and F1-score, respectively, were obtained with the proposed ResNet approach, confirming its efficacy and reassuring its practical applicability.
最近,我们目睹深度学习方法在基于严重程度的构音障碍语音分类中受到了广泛关注。在各种实际应用中,检测构音障碍并量化其严重程度至关重要,例如评估患者的治疗进展,这包括对其治疗进行适当规划以及改进基于语音的交互系统,以便自动处理病理影响的声音。值得注意的是,当前的语音工具通常处理短时长的语音片段,因此,即使使用卷积神经网络(CNN),在处理受损语音时效率也较低。因此,基于短语音片段检测构音障碍严重程度可能有助于提高这些系统的性能和适用性。为了实现这一目标,我们提出了一种新颖的基于残差网络(ResNet)的技术,该技术接收短时长语音片段作为输入。在标准通用访问语料库上报告的我们实验的具有统计学意义的客观分析表明,与基线CNN相比,在分类准确率和F1分数方面分别有21.35%和22.48%的平均提升。为了进行更多比较,还使用高斯混合模型和轻量级CNN进行了测试。总体而言,所提出的ResNet方法分别获得了98.90%和98.00%的分类准确率和F1分数,证实了其有效性并确保了其实际适用性。