Department of Signal Processing and Acoustics, Aalto University, Finland.
Department of Signal Processing and Acoustics, Aalto University, Finland.
J Voice. 2024 Sep;38(5):975-982. doi: 10.1016/j.jvoice.2022.03.021. Epub 2022 Apr 27.
Automatic voice pathology detection is a research topic, which has gained increasing interest recently. Although methods based on deep learning are becoming popular, the classical pipeline systems based on a two-stage architecture consisting of a feature extraction stage and a classifier stage are still widely used. In these classical detection systems, frame-wise computation of mel-frequency cepstral coefficients (MFCCs) is the most popular feature extraction method. However, no systematic study has been conducted to investigate the effect of the MFCC frame length on automatic voice pathology detection. In this work, we studied the effect of the MFCC frame length in voice pathology detection using three disorders (hyperkinetic dysphonia, hypokinetic dysphonia and reflux laryngitis) from the Saarbrücken Voice Disorders (SVD) database. The detection performance was compared between speaker-dependent and speaker-independent scenarios as well as between speaking task -dependent and speaking task -independent scenarios. The Support Vector Machine, which is the most widely used classifier in the study area, was used as the classifier. The results show that the detection accuracy depended on the MFFC frame length in all the scenarios studied. The best detection accuracy was obtained by using a MFFC frame length of 500 ms with a shift of 5 ms.
自动语音病理学检测是一个研究课题,最近越来越受到关注。虽然基于深度学习的方法变得越来越流行,但基于两级架构(特征提取阶段和分类器阶段)的经典流水线系统仍然被广泛使用。在这些经典的检测系统中,逐帧计算梅尔频率倒谱系数(MFCC)是最流行的特征提取方法。然而,还没有系统的研究来研究 MFCC 帧长度对自动语音病理学检测的影响。在这项工作中,我们使用来自 Saarbrücken 语音障碍(SVD)数据库的三种障碍(多动性发音障碍、运动性发音障碍和反流性喉炎)研究了 MFCC 帧长度对语音障碍检测的影响。在说话者相关和说话者无关的场景以及说话任务相关和说话任务无关的场景之间比较了检测性能。支持向量机(Support Vector Machine)是该研究领域最广泛使用的分类器,被用作分类器。结果表明,在所有研究的场景中,检测准确性都取决于 MFCC 帧长度。使用 500ms 的 MFCC 帧长度和 5ms 的偏移量可以获得最佳的检测准确性。