Dao Thao Thi Phuong, Huynh Tuan-Luc, Pham Minh-Khoi, Le Trung-Nghia, Nguyen Tan-Cong, Nguyen Quang-Thuc, Tran Bich Anh, Van Boi Ngoc, Ha Chanh Cong, Tran Minh-Triet
University of Science, Ho Chi Minh City, Vietnam.
John von Neumann Institute, Ho Chi Minh City, Vietnam.
J Imaging Inform Med. 2024 Dec;37(6):2794-2809. doi: 10.1007/s10278-024-01068-z. Epub 2024 May 29.
The diagnosis and treatment of vocal fold disorders heavily rely on the use of laryngoscopy. A comprehensive vocal fold diagnosis requires accurate identification of crucial anatomical structures and potential lesions during laryngoscopy observation. However, existing approaches have yet to explore the joint optimization of the decision-making process, including object detection and image classification tasks simultaneously. In this study, we provide a new dataset, VoFoCD, with 1724 laryngology images designed explicitly for object detection and image classification in laryngoscopy images. Images in the VoFoCD dataset are categorized into four classes and comprise six glottic object types. Moreover, we propose a novel Multitask Efficient trAnsformer network for Laryngoscopy (MEAL) to classify vocal fold images and detect glottic landmarks and lesions. To further facilitate interpretability for clinicians, MEAL provides attention maps to visualize important learned regions for explainable artificial intelligence results toward supporting clinical decision-making. We also analyze our model's effectiveness in simulated clinical scenarios where shaking of the laryngoscopy process occurs. The proposed model demonstrates outstanding performance on our VoFoCD dataset. The accuracy for image classification and mean average precision at an intersection over a union threshold of 0.5 (mAP50) for object detection are 0.951 and 0.874, respectively. Our MEAL method integrates global knowledge, encompassing general laryngoscopy image classification, into local features, which refer to distinct anatomical regions of the vocal fold, particularly abnormal regions, including benign and malignant lesions. Our contribution can effectively aid laryngologists in identifying benign or malignant lesions of vocal folds and classifying images in the laryngeal endoscopy process visually.
声带疾病的诊断和治疗严重依赖于喉镜检查的使用。全面的声带诊断需要在喉镜检查观察过程中准确识别关键解剖结构和潜在病变。然而,现有方法尚未探索决策过程的联合优化,包括同时进行目标检测和图像分类任务。在本研究中,我们提供了一个新的数据集VoFoCD,其中包含1724张专门为喉镜图像中的目标检测和图像分类设计的喉科学图像。VoFoCD数据集中的图像分为四类,包含六种声门目标类型。此外,我们提出了一种用于喉镜检查的新型多任务高效Transformer网络(MEAL),用于对声带图像进行分类,并检测声门标志和病变。为了进一步提高临床医生的可解释性,MEAL提供注意力图,以可视化重要的学习区域,从而为支持临床决策的可解释人工智能结果提供支持。我们还分析了我们的模型在喉镜检查过程发生抖动的模拟临床场景中的有效性。所提出的模型在我们的VoFoCD数据集上表现出色。图像分类的准确率和目标检测在交并比阈值为0.5时的平均精度均值(mAP50)分别为0.951和0.874。我们的MEAL方法将包括一般喉镜图像分类在内的全局知识整合到局部特征中,这些局部特征指的是声带的不同解剖区域,特别是异常区域,包括良性和恶性病变。我们的贡献可以有效地帮助喉科医生在喉镜检查过程中直观地识别声带的良性或恶性病变并对图像进行分类。