Liu George S, Hodges Jordan M, Yu Jingzhi, Sung C Kwang, Erickson-DiRenzo Elizabeth, Doyle Philip C
Department of Otolaryngology Head and Neck Surgery Stanford University School of Medicine, Stanford University Stanford California USA.
Division of Laryngology Stanford University School of Medicine, Stanford University Stanford California USA.
Laryngoscope Investig Otolaryngol. 2023 Aug 31;8(5):1312-1318. doi: 10.1002/lio2.1144. eCollection 2023 Oct.
Advances in artificial intelligence (AI) technology have increased the feasibility of classifying voice disorders using voice recordings as a screening tool. This work develops upon previous models that take in single vowel recordings by analyzing multiple vowel recordings simultaneously to enhance prediction of vocal pathology.
Voice samples from the Saarbruecken Voice Database, including three sustained vowels (/a/, /i/, /u/) from 687 healthy human participants and 334 dysphonic patients, were used to train 1-dimensional convolutional neural network models for multiclass classification of healthy, hyperfunctional dysphonia, and laryngitis voice recordings. Three models were trained: (1) a baseline model that analyzed individual vowels in isolation, (2) a stacked vowel model that analyzed three vowels (/a/, /i/, /u/) in the neutral pitch simultaneously, and (3) a stacked pitch model that analyzed the /a/ vowel in three pitches (low, neutral, and high) simultaneously.
For multiclass classification of healthy, hyperfunctional dysphonia, and laryngitis voice recordings, the stacked vowel model demonstrated higher performance compared with the baseline and stacked pitch models (F1 score 0.81 vs. 0.77 and 0.78, respectively). Specifically, the stacked vowel model achieved higher performance for class-specific classification of hyperfunctional dysphonia voice samples compared with the baseline and stacked pitch models (F1 score 0.56 vs. 0.49 and 0.50, respectively).
This study demonstrates the feasibility and potential of analyzing multiple sustained vowel recordings simultaneously to improve AI-driven screening and classification of vocal pathology. The stacked vowel model architecture in particular offers promise to enhance such an approach.
AI analysis of multiple vowel recordings can improve classification of voice pathologies compared with models using a single sustained vowel and offer a strategy to enhance AI-driven screening of voice disorders.
人工智能(AI)技术的进步提高了使用语音记录作为筛查工具对语音障碍进行分类的可行性。这项工作在之前仅分析单个元音记录的模型基础上进行拓展,通过同时分析多个元音记录来增强对嗓音病理学的预测。
来自萨尔布吕肯语音数据库的语音样本,包括687名健康人类参与者和334名发声障碍患者的三个持续元音(/a/、/i/、/u/),用于训练一维卷积神经网络模型,以对健康、功能亢进性发声障碍和喉炎语音记录进行多类分类。训练了三个模型:(1)一个单独分析单个元音的基线模型;(2)一个同时分析中性音高的三个元音(/a/、/i/、/u/)的堆叠元音模型;(3)一个同时分析三个音高(低、中性和高)的/a/元音的堆叠音高模型。
对于健康、功能亢进性发声障碍和喉炎语音记录的多类分类,堆叠元音模型与基线模型和堆叠音高模型相比表现出更高的性能(F1分数分别为0.81、0.77和0.78)。具体而言,与基线模型和堆叠音高模型相比,堆叠元音模型在功能亢进性发声障碍语音样本的特定类别分类中表现出更高的性能(F1分数分别为0.56、0.49和0.50)。
本研究证明了同时分析多个持续元音记录以改善人工智能驱动的嗓音病理学筛查和分类的可行性和潜力。特别是堆叠元音模型架构有望增强这种方法。
与使用单个持续元音的模型相比,对多个元音记录进行人工智能分析可以改善语音病理学的分类,并提供一种增强人工智能驱动的语音障碍筛查的策略。
3级