Borsky Michal, Mehta Daryush D, Van Stan Jarrad H, Gudnason Jon
IEEE/ACM Trans Audio Speech Lang Process. 2017 Dec;25(12):2281-2291. doi: 10.1109/taslp.2017.2759002. Epub 2017 Nov 27.
The goal of this study was to investigate the performance of different feature types for voice quality classification using multiple classifiers. The study compared the COVAREP feature set; which included glottal source features, frequency warped cepstrum and harmonic model features; against the mel-frequency cepstral coefficients (MFCCs) computed from the acoustic voice signal, acoustic-based glottal inverse filtered (GIF) waveform, and electroglottographic (EGG) waveform. Our hypothesis was that MFCCs can capture the perceived voice quality from either of these three voice signals. Experiments were carried out on recordings from 28 participants with normal vocal status who were prompted to sustain vowels with modal and non-modal voice qualities. Recordings were rated by an expert listener using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), and the ratings were transformed into a dichotomous label (presence or absence) for the prompted voice qualities of modal voice, breathiness, strain, and roughness. The classification was done using support vector machines, random forests, deep neural networks and Gaussian mixture model classifiers, which were built as speaker independent using a leave-one-speaker-out strategy. The best classification accuracy of 79.97% was achieved for the full COVAREP set. The harmonic model features were the best performing subset, with 78.47% accuracy, and the static+dynamic MFCCs scored at 74.52%. A closer analysis showed that MFCC and dynamic MFCC features were able to classify modal, breathy, and strained voice quality dimensions from the acoustic and GIF waveforms. Reduced classification performance was exhibited by the EGG waveform.
本研究的目的是使用多个分类器来调查不同特征类型在语音质量分类方面的性能。该研究将包含声门源特征、频率扭曲倒谱和谐波模型特征的COVAREP特征集与从声学语音信号、基于声学的声门逆滤波(GIF)波形和电子声门图(EGG)波形计算得到的梅尔频率倒谱系数(MFCC)进行了比较。我们的假设是MFCC能够从这三种语音信号中的任何一种捕捉到感知到的语音质量。对28名具有正常发声状态的参与者的录音进行了实验,这些参与者被要求用正常和非正常语音质量持续发出元音。录音由一名专业听众使用语音的共识听觉-感知评估(CAPE-V)进行评分,并且评分被转换为一个二分标签(存在或不存在),用于表示正常语音、呼吸声、紧张和粗糙等提示语音质量。分类使用支持向量机、随机森林、深度神经网络和高斯混合模型分类器完成,这些分类器采用留一法策略构建为与说话者无关的。完整的COVAREP集实现了79.97%的最佳分类准确率。谐波模型特征是表现最佳的子集,准确率为78.47%,静态+动态MFCC的准确率为74.52%。进一步分析表明,MFCC和动态MFCC特征能够从声学和GIF波形中对正常、呼吸声和紧张的语音质量维度进行分类。EGG波形表现出较低的分类性能。