Lee Ki-Sun, Lee Eunyoung, Choi Bareun, Pyun Sung-Bom
Medical Science Research Center, Ansan Hospital, Korea University College of Medicine, Ansan-si 15355, Korea.
Department of Physical Medicine and Rehabilitation, Anam Hospital, Korea University College of Medicine, Seoul 02841, Korea.
Diagnostics (Basel). 2021 Feb 13;11(2):300. doi: 10.3390/diagnostics11020300.
Video fluoroscopic swallowing study (VFSS) is considered as the gold standard diagnostic tool for evaluating dysphagia. However, it is time consuming and labor intensive for the clinician to manually search the recorded long video image frame by frame to identify the instantaneous swallowing abnormality in VFSS images. Therefore, this study aims to present a deep leaning-based approach using transfer learning with a convolutional neural network (CNN) that automatically annotates pharyngeal phase frames in untrimmed VFSS videos such that frames need not be searched manually.
To determine whether the image frame in the VFSS video is in the pharyngeal phase, a single-frame baseline architecture based the deep CNN framework is used and a transfer learning technique with fine-tuning is applied.
Compared with all experimental CNN models, that fine-tuned with two blocks of the VGG-16 (VGG16-FT5) model achieved the highest performance in terms of recognizing the frame of pharyngeal phase, that is, the accuracy of 93.20 (±1.25)%, sensitivity of 84.57 (±5.19)%, specificity of 94.36 (±1.21)%, AUC of 0.8947 (±0.0269) and Kappa of 0.7093 (±0.0488).
Using appropriate and fine-tuning techniques and explainable deep learning techniques such as grad CAM, this study shows that the proposed single-frame-baseline-architecture-based deep CNN framework can yield high performances in the full automation of VFSS video analysis.
视频荧光吞咽造影研究(VFSS)被认为是评估吞咽困难的金标准诊断工具。然而,临床医生逐帧手动搜索记录的长视频图像以识别VFSS图像中的瞬时吞咽异常既耗时又费力。因此,本研究旨在提出一种基于深度学习的方法,使用卷积神经网络(CNN)进行迁移学习,自动标注未剪辑的VFSS视频中的咽期帧,从而无需手动搜索帧。
为了确定VFSS视频中的图像帧是否处于咽期,使用了基于深度CNN框架的单帧基线架构,并应用了带有微调的迁移学习技术。
与所有实验性CNN模型相比,用VGG-16的两个模块进行微调的模型(VGG16-FT5)在识别咽期帧方面表现最佳,即准确率为93.20(±1.25)%,灵敏度为84.57(±5.19)%,特异性为94.36(±1.21)%,曲线下面积(AUC)为0.8947(±0.0269),kappa值为0.7093(±0.0488)。
本研究表明,使用适当的微调技术和诸如梯度加权类激活映射(grad CAM)等可解释的深度学习技术,所提出的基于单帧基线架构的深度CNN框架在VFSS视频分析的完全自动化方面可产生高性能。