Schraut Tobias, Döllinger Michael, Kunduk Melda, Echternach Matthias, Dürr Stephan, Werz Julia, Schützenberger Anne
Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany.
Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany.
J Voice. 2025 Jan 3. doi: 10.1016/j.jvoice.2024.12.008.
This study investigates the use of sustained phonations recorded during high-speed videoendoscopy (HSV) for machine learning-based assessment of hoarseness severity (H). The performance of this approach is compared with conventional recordings obtained during voice therapy to evaluate key differences and limitations of HSV-derived acoustic recordings.
A database of 617 voice recordings with a duration of 250 ms was gathered during HSV examination (HS). Two databases comprising 809 vowels recorded during voice therapy were used for comparison, examining recording durations of 1 second (VT-1) and 250 ms (VT-2). A total of 490 features were extracted, including perturbation and noise characteristics, spectral and cepstral coefficients, as well as features based on modulation spectrum, nonlinear dynamic analysis, entropy, and empirical mode decomposition. Model development focused on selecting a minimal-optimal feature subset and suitable classification algorithms. Recordings were classified into two groups of hoarseness based on auditory-perceptual ratings by experts, yielding a continuous hoarseness score yˆ. Model performance was evaluated based on classification accuracy, correlation between predicted scores yˆ∈[0,1] and subjective ratings H∈{0,1,2,3}, and correlation between the relative change in quantitative and subjective ratings.
Logistic regression combined with five acoustic features achieved a classification accuracy of 0.863 (VT-1), 0.847 (VT-2), and 0.742 (HS) on the test sets. A correlation of 0.797 (VT-1), 0.763 (VT-2), and 0.637 (HS) was obtained between yˆ and H, respectively. For 21 test subjects with two recordings, the model yielded a correlation of 0.592 (VT-1), 0.486 (VT-2), and 0.088 (HS) between ∆yˆ and ∆H.
While acoustic signals recorded during HSV show potential for quantitative hoarseness assessment, they are less reliable than voice therapy recordings due to practical challenges associated with oral laryngeal examination. Addressing these limitations, for example, through the use of flexible nasal endoscopy, could improve the quality of HSV-derived acoustic recordings and voice assessments.
本研究调查在高速视频内镜检查(HSV)期间记录的持续发声用于基于机器学习的声音嘶哑严重程度(H)评估的情况。将该方法的性能与在语音治疗期间获得的传统录音进行比较,以评估HSV衍生的声学录音的关键差异和局限性。
在HSV检查(HS)期间收集了一个包含617条时长为250毫秒的语音记录的数据库。使用两个包含在语音治疗期间记录的809个元音的数据库进行比较,分别检查1秒(VT - 1)和250毫秒(VT - 2)的记录时长。总共提取了490个特征,包括微扰和噪声特征、频谱和倒谱系数,以及基于调制谱、非线性动态分析、熵和经验模态分解的特征。模型开发侧重于选择最小最优特征子集和合适的分类算法。根据专家基于听觉感知的评分将记录分为两组声音嘶哑,得出连续的声音嘶哑评分ŷ。基于分类准确率、预测分数ŷ∈[0,1]与主观评分H∈{0,1,2,3}之间的相关性以及定量和主观评分的相对变化之间的相关性来评估模型性能。
逻辑回归结合五个声学特征在测试集上实现的分类准确率分别为:VT - 1组0.863、VT - 2组0.847、HS组0.742。ŷ与H之间分别获得的相关性为:VT - 1组0.797、VT - 2组0.763、HS组0.637。对于有两条记录的21名测试受试者,模型在∆ŷ与∆H之间产生的相关性分别为:VT - 1组0.592、VT - 2组0.486、HS组0.088。
虽然HSV期间记录的声学信号在声音嘶哑定量评估方面显示出潜力,但由于与口喉检查相关的实际挑战,它们不如语音治疗记录可靠。解决这些限制因素,例如通过使用柔性鼻内镜检查,可能会提高HSV衍生的声学记录和声音评估的质量。