Schraut Tobias, Schützenberger Anne, Arias-Vergara Tomás, Kunduk Melda, Echternach Matthias, Dürr Stephan, Werz Julia, Döllinger Michael
Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.
Pattern Recognition Lab, Chair of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.
Front Artif Intell. 2025 Jun 5;8:1601716. doi: 10.3389/frai.2025.1601716. eCollection 2025.
Functional voice disorders are characterized by impaired voice production without primary organic changes, posing challenges for standardized assessment. Current diagnostic methods rely heavily on subjective evaluation, suffering from inter-rater variability. High-speed videoendoscopy (HSV) offers an objective alternative by capturing true intra-cycle vocal fold behavior. Integrating time-synchronized acoustic and HSV recordings could allow for an objective visual and acoustic assessment of vocal function based on a single HSV examination. This study investigates a machine learning-based approach for hoarseness severity assessment using synchronous HSV and acoustic recordings, alongside conventional voice examinations.
Three databases comprising 457 HSV recordings of the sustained vowel /i/, 634 HSV-synchronized acoustic recordings, and clinical parameters from 923 visits were analyzed. Subjects were classified into two hoarseness groups based on auditory-perceptual ratings, with predicted scores serving as continuous hoarseness severity ratings. A videoendoscopic model was developed by selecting a suitable classification algorithm and a minimal-optimal subset of glottal parameters. This model was compared against an acoustic model based on HSV-synchronized recordings and a clinical model based on parameters from other examinations. Two ensemble models were constructed by combining the HSV-based models and all models, respectively. Model performance was evaluated on a shared test set based on classification accuracy, correlation with subjective ratings, and correlation between predicted and observed changes in hoarseness severity.
The videoendoscopic, acoustic, and clinical model achieved correlations of 0.464, 0.512, and 0.638 with subjective hoarseness ratings. Integrating glottal and acoustic parameters into the HSV-based ensemble model improved correlation to 0.603, confirming the complementary nature of time-synchronized HSV and acoustic recordings. The ensemble model incorporating all modalities achieved the highest correlation of 0.752, underscoring the diagnostic value of multimodal objective assessments.
This study highlights the potential of synchronous HSV and acoustic recordings for objective hoarseness severity assessment, offering a more comprehensive evaluation of vocal function. While practical challenges remain, the integration of these modalities led to notable improvements, supporting their complementary value in enhancing diagnostic accuracy. Future advancements could include flexible nasal endoscopy to enable more natural phonation and refinement of glottal parameter extraction to improve model robustness under variable recording conditions.
功能性嗓音障碍的特征是在无原发性器质性改变的情况下嗓音产生受损,这给标准化评估带来了挑战。当前的诊断方法严重依赖主观评估,存在评分者间的差异。高速视频内镜检查(HSV)通过捕捉声带在一个完整周期内的真实行为提供了一种客观的替代方法。整合时间同步的声学和HSV记录可以基于单次HSV检查对嗓音功能进行客观的视觉和声学评估。本研究探讨了一种基于机器学习的方法,使用同步HSV和声学记录以及传统嗓音检查来评估嘶哑严重程度。
分析了三个数据库,包括457个持续元音/i/的HSV记录、634个与HSV同步的声学记录以及来自923次就诊的临床参数。根据听觉感知评分将受试者分为两个嘶哑组,预测分数作为连续的嘶哑严重程度评分。通过选择合适的分类算法和喉门参数的最小最优子集开发了一个视频内镜模型。将该模型与基于HSV同步记录的声学模型以及基于其他检查参数的临床模型进行比较。分别通过组合基于HSV的模型和所有模型构建了两个集成模型。基于分类准确性、与主观评分的相关性以及预测和观察到的嘶哑严重程度变化之间的相关性,在一个共享测试集上评估模型性能。
视频内镜模型、声学模型和临床模型与主观嘶哑评分的相关性分别为0.464、0.512和0.638。将喉门和声参数整合到基于HSV的集成模型中,相关性提高到0.603,证实了时间同步的HSV和声学记录的互补性。包含所有模式的集成模型相关性最高,为0.752,强调了多模式客观评估的诊断价值。
本研究强调了同步HSV和声学记录在客观评估嘶哑严重程度方面的潜力,为嗓音功能提供了更全面的评估。虽然实际挑战仍然存在,但这些模式的整合带来了显著改善,支持了它们在提高诊断准确性方面的互补价值。未来的进展可能包括使用可弯曲鼻内镜以实现更自然的发声,以及改进喉门参数提取以提高在可变记录条件下模型的稳健性。