Kim Yeo E, Dobko Maria, Li Haomiao, Shao Tianlan, Periyakoil Preethi, Tipton Courtney, Colasacco Christine, Serpedin Aisha, Elemento Olivier, Sabuncu Mert, Pitman Michael, Sulica Lucian, Rameau Anaïs
Sean Parker Institute for the Voice, Department of Otolaryngology-Head and Neck Surgery, Weill Cornell Medicine, New York, New York, U.S.A.
Cornell Tech, New York, New York, U.S.A.
Laryngoscope. 2025 Jul;135(7):2428-2436. doi: 10.1002/lary.32036. Epub 2025 Feb 5.
To develop and validate a deep-learning classifier trained on voice data extracted from videolaryngostroboscopy recordings, differentiating between three different vocal fold (VF) states: healthy (HVF), unilateral paralysis (UVFP), and VF lesions, including benign and malignant pathologies.
Patients with UVFP (n = 105), VF lesions (n = 63), and HVF (n = 41) were retrospectively identified. Voice samples were extracted from stroboscopic videos (Pentax Laryngeal Strobe Model 9400), including sustained /i/ phonation, pitch glide, and /i/ sniff task. Extracted audio files were converted into Mel-spectrograms. Voice samples were independently divided into training (80%), validation (10%), and test (10%) by patient. Pretrained ResNet18 models were trained to classify (1) HVF and pathological VF (lesions and UVFP), and (2) HVF, UVFP, and VF lesions. Both classifiers were further validated on an external dataset consisting of 12 UVFP, 13 VF lesions, and 15 HVF patients. Model performances were evaluated by accuracy and F1-score.
When evaluated on a hold-out test set, the binary classifier demonstrated stronger performance compared to the multi-class classifier (accuracy 83% vs. 40%; F1-score 0.90 vs. 0.36). When evaluated on an external dataset, the binary classifier achieved an accuracy of 63% and F1-score of 0.48, compared to 35% and 0.25 for the multi-class classifier.
Deep-learning classifiers differentiating HVF, UVFP, and VF lesions were developed using voice data from stroboscopic videos. Although healthy and pathological voice were differentiated with moderate accuracy, multi-class classification lowered model performance. The model performed poorly on an external dataset. Voice captured in stroboscopic videos may have limited diagnostic value, though further studies are needed.
4 Laryngoscope, 135:2428-2436, 2025.
开发并验证一种基于从视频喉镜检查记录中提取的语音数据训练的深度学习分类器,以区分三种不同的声带(VF)状态:健康(HVF)、单侧麻痹(UVFP)和VF病变,包括良性和恶性病变。
回顾性纳入UVFP患者(n = 105)、VF病变患者(n = 63)和HVF患者(n = 41)。从频闪视频(宾得喉频闪模型9400)中提取语音样本,包括持续发/i/音、音高滑动和/i/吸气任务。提取的音频文件转换为梅尔频谱图。语音样本按患者独立分为训练集(80%)、验证集(10%)和测试集(10%)。对预训练的ResNet18模型进行训练,以分类(1)HVF和病理性VF(病变和UVFP),以及(2)HVF、UVFP和VF病变。两个分类器均在由12例UVFP、13例VF病变和15例HVF患者组成的外部数据集上进一步验证。通过准确率和F1分数评估模型性能。
在保留测试集上进行评估时,二元分类器的性能优于多分类器(准确率83%对40%;F1分数0.90对0.36)。在外部数据集上进行评估时,二元分类器的准确率为63%,F1分数为0.48,而多分类器分别为35%和0.25。
利用频闪视频中的语音数据开发了区分HVF、UVFP和VF病变的深度学习分类器。尽管健康语音和病理性语音能够以中等准确率区分,但多分类降低了模型性能。该模型在外部数据集上表现不佳。频闪视频中捕获的语音可能具有有限的诊断价值,不过仍需进一步研究。
4 喉镜,135:2428 - 2436,2025。