使用语音数据进行喉疾病分类：倍频程滤波器与梅尔频率滤波器

Laryngeal disease classification using voice data: Octave-band vs. mel-frequency filters.

作者信息

Song Jaemin, Kim Hyunbum, Lee Yong Oh

机构信息

Department of Industrial and Data Engineering, Hongik University, Seoul, South Korea.

Department of Otolaryngology-Head and Neck Surgery, The Catholic University of Korea, Seoul, South Korea.

出版信息

Heliyon. 2024 Nov 30;10(24):e40748. doi: 10.1016/j.heliyon.2024.e40748. eCollection 2024 Dec 30.

DOI:10.1016/j.heliyon.2024.e40748

PMID:39720068

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11667598/

Abstract

INTRODUCTION

Laryngeal cancer diagnosis relies on specialist examinations, but non-invasive methods using voice data are emerging with artificial intelligence (AI) advancements. Mel Frequency Cepstral Coefficients (MFCCs) are widely used for voice analysis, but Octave Frequency Spectrum Energy (OFSE) may offer better accuracy in detecting subtle voice changes.

PROBLEM STATEMENT

Accurate early diagnosis of laryngeal cancer through voice data is challenging with current methods like MFCC.

OBJECTIVES

This study compares the effectiveness of MFCC and OFSE in classifying voice data into healthy, laryngeal cancer, benign mucosal disease, and vocal fold paralysis categories.

METHODS

Voice samples from 363 patients were analyzed using CNN models, employing MFCC and OFSE with 1/3 octave band filters. Grad-Class Activation Mapping (Grad-CAM) was used to visualize key voice features.

RESULTS

OFSE with 1/3 octave band filters outperformed MFCC in classification accuracy, especially in multi-class classification including laryngeal cancer, benign mucosal disease, and vocal fold paralysis groups (0.9398 ± 0.0232 vs. 0.7061 ± 0.0561). Grad-CAM analysis revealed that OFSE with 1/3 octave band filters effectively distinguished laryngeal cancer from healthy voices by focusing on increased noise in the over-formant area and changes in the fundamental frequency. The analysis also highlighted that specific narrow frequency areas, particularly in vocal fold paralysis, were critical for classification, and benign mucosal diseases occasionally resembled healthy voices, making AI differentiation between benign conditions and laryngeal cancer a significant challenge.

CONCLUSION

OFSE with 1/3 octave band filters provides superior accuracy in diagnosing laryngeal diseases including laryngeal cancer, showing potential for non-invasive, AI-driven early detection.

摘要

引言

喉癌诊断依赖于专业检查，但随着人工智能（AI）的发展，利用语音数据的非侵入性方法正在兴起。梅尔频率倒谱系数（MFCCs）被广泛用于语音分析，但倍频程频谱能量（OFSE）在检测细微语音变化方面可能具有更高的准确性。

问题陈述

目前使用MFCC等方法通过语音数据准确早期诊断喉癌具有挑战性。

目的

本研究比较了MFCC和OFSE在将语音数据分类为健康、喉癌、良性黏膜疾病和声带麻痹类别方面的有效性。

方法

使用卷积神经网络（CNN）模型对363名患者的语音样本进行分析，采用带有1/3倍频程带通滤波器的MFCC和OFSE。梯度类激活映射（Grad-CAM）用于可视化关键语音特征。

结果

带有1/3倍频程带通滤波器的OFSE在分类准确性方面优于MFCC，尤其是在包括喉癌、良性黏膜疾病和声带麻痹组的多类别分类中（0.9398±0.0232对0.7061±0.0561）。Grad-CAM分析表明，带有1/3倍频程带通滤波器的OFSE通过关注共振峰上方区域增加的噪声和基频变化，有效地将喉癌与健康语音区分开来。分析还强调，特定的窄频率区域，特别是在声带麻痹中，对分类至关重要，并且良性黏膜疾病偶尔与健康语音相似，这使得人工智能区分良性疾病和喉癌成为一项重大挑战。