Song Jaemin, Kim Hyunbum, Lee Yong Oh
Department of Industrial and Data Engineering, Hongik University, Seoul, South Korea.
Department of Otolaryngology-Head and Neck Surgery, The Catholic University of Korea, Seoul, South Korea.
Heliyon. 2024 Nov 30;10(24):e40748. doi: 10.1016/j.heliyon.2024.e40748. eCollection 2024 Dec 30.
Laryngeal cancer diagnosis relies on specialist examinations, but non-invasive methods using voice data are emerging with artificial intelligence (AI) advancements. Mel Frequency Cepstral Coefficients (MFCCs) are widely used for voice analysis, but Octave Frequency Spectrum Energy (OFSE) may offer better accuracy in detecting subtle voice changes.
Accurate early diagnosis of laryngeal cancer through voice data is challenging with current methods like MFCC.
This study compares the effectiveness of MFCC and OFSE in classifying voice data into healthy, laryngeal cancer, benign mucosal disease, and vocal fold paralysis categories.
Voice samples from 363 patients were analyzed using CNN models, employing MFCC and OFSE with 1/3 octave band filters. Grad-Class Activation Mapping (Grad-CAM) was used to visualize key voice features.
OFSE with 1/3 octave band filters outperformed MFCC in classification accuracy, especially in multi-class classification including laryngeal cancer, benign mucosal disease, and vocal fold paralysis groups (0.9398 ± 0.0232 vs. 0.7061 ± 0.0561). Grad-CAM analysis revealed that OFSE with 1/3 octave band filters effectively distinguished laryngeal cancer from healthy voices by focusing on increased noise in the over-formant area and changes in the fundamental frequency. The analysis also highlighted that specific narrow frequency areas, particularly in vocal fold paralysis, were critical for classification, and benign mucosal diseases occasionally resembled healthy voices, making AI differentiation between benign conditions and laryngeal cancer a significant challenge.
OFSE with 1/3 octave band filters provides superior accuracy in diagnosing laryngeal diseases including laryngeal cancer, showing potential for non-invasive, AI-driven early detection.
喉癌诊断依赖于专业检查,但随着人工智能(AI)的发展,利用语音数据的非侵入性方法正在兴起。梅尔频率倒谱系数(MFCCs)被广泛用于语音分析,但倍频程频谱能量(OFSE)在检测细微语音变化方面可能具有更高的准确性。
目前使用MFCC等方法通过语音数据准确早期诊断喉癌具有挑战性。
本研究比较了MFCC和OFSE在将语音数据分类为健康、喉癌、良性黏膜疾病和声带麻痹类别方面的有效性。
使用卷积神经网络(CNN)模型对363名患者的语音样本进行分析,采用带有1/3倍频程带通滤波器的MFCC和OFSE。梯度类激活映射(Grad-CAM)用于可视化关键语音特征。
带有1/3倍频程带通滤波器的OFSE在分类准确性方面优于MFCC,尤其是在包括喉癌、良性黏膜疾病和声带麻痹组的多类别分类中(0.9398±0.0232对0.7061±0.0561)。Grad-CAM分析表明,带有1/3倍频程带通滤波器的OFSE通过关注共振峰上方区域增加的噪声和基频变化,有效地将喉癌与健康语音区分开来。分析还强调,特定的窄频率区域,特别是在声带麻痹中,对分类至关重要,并且良性黏膜疾病偶尔与健康语音相似,这使得人工智能区分良性疾病和喉癌成为一项重大挑战。
带有1/3倍频程带通滤波器的OFSE在诊断包括喉癌在内的喉部疾病方面具有更高的准确性,显示出非侵入性、人工智能驱动的早期检测潜力。