Nogueira-Rodríguez Alba, Reboiro-Jato Miguel, Glez-Peña Daniel, López-Fernández Hugo
CINBIO, Department of Computer Science, ESEI-Escuela Superior de Ingeniería Informática, Universidade de Vigo, 32004 Ourense, Spain.
SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36213 Vigo, Spain.
Diagnostics (Basel). 2022 Apr 4;12(4):898. doi: 10.3390/diagnostics12040898.
Colorectal cancer is one of the most frequent malignancies. Colonoscopy is the de facto standard for precancerous lesion detection in the colon, i.e., polyps, during screening studies or after facultative recommendation. In recent years, artificial intelligence, and especially deep learning techniques such as convolutional neural networks, have been applied to polyp detection and localization in order to develop real-time CADe systems. However, the performance of machine learning models is very sensitive to changes in the nature of the testing instances, especially when trying to reproduce results for totally different datasets to those used for model development, i.e., inter-dataset testing. Here, we report the results of testing of our previously published polyp detection model using ten public colonoscopy image datasets and analyze them in the context of the results of other 20 state-of-the-art publications using the same datasets. The F1-score of our recently published model was 0.88 when evaluated on a private test partition, i.e., intra-dataset testing, but it decayed, on average, by 13.65% when tested on ten public datasets. In the published research, the average intra-dataset F1-score is 0.91, and we observed that it also decays in the inter-dataset setting to an average F1-score of 0.83.
结直肠癌是最常见的恶性肿瘤之一。结肠镜检查是在筛查研究期间或根据选择性建议后对结肠癌前病变(即息肉)进行检测的实际标准。近年来,人工智能,尤其是卷积神经网络等深度学习技术,已被应用于息肉检测和定位,以开发实时计算机辅助检测(CADe)系统。然而,机器学习模型的性能对测试实例性质的变化非常敏感,尤其是当试图为与用于模型开发的数据集完全不同的数据集重现结果时,即跨数据集测试。在此,我们报告了使用十个公开的结肠镜检查图像数据集对我们之前发表的息肉检测模型进行测试的结果,并在其他20篇使用相同数据集的最新出版物的结果背景下对其进行分析。我们最近发表的模型在私有测试分区(即数据集内测试)上评估时的F1分数为0.88,但在十个公开数据集上进行测试时,平均下降了13.65%。在已发表的研究中,数据集内F1分数的平均值为0.91,我们观察到在跨数据集设置中它也下降到平均F1分数0.83。