Eindhoven University of Technology, Groene Loper 3, 5612 AE Eindhoven, The Netherlands.
Amsterdam UMC, Location VUmc, De Boelelaan 1117, 1081 HV Amsterdam, The Netherlands.
Med Image Anal. 2024 Dec;98:103298. doi: 10.1016/j.media.2024.103298. Epub 2024 Aug 12.
Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to training from scratch, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images. To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation. The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks. Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images. Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights.
使用包含自然图像的大型数据集(例如 ImageNet)对深度学习模型进行预训练已经成为内镜图像分析的标准。由于高质量医学图像和标签的稀缺,这种方法通常优于从头开始训练。然而,目前尚不清楚在自然图像上学习的特征是否为下游医学内镜成像任务提供了最佳起点。直观地说,使用更接近目标领域的图像进行预训练可能会导致更适合的特征表示。本研究评估了在胃肠内镜图像分析中利用领域内预训练是否比在自然图像上进行预训练具有潜在优势。为此,我们提出了一个包含来自八个不同医疗中心的 5,014,174 张胃肠道内镜图像的数据集(GastroNet-5M),并利用 SimCLRv2、MoCov2 和 DINO 进行自监督学习,为领域内下游任务学习相关特征。将学习到的特征与通过多种方法从自然图像中学习到的特征进行比较,并使用不同数量的数据和/或标签(例如,十亿级半弱监督学习和在 ImageNet-21k 上的监督学习)。在五个专门为各种胃肠道任务设计的下游数据集上评估效果,例如用于血管发育检测的 GIANA 和用于息肉分割的 Kvasir-SEG。研究结果表明,与在自然图像上进行任何监督预训练相比,使用特定于领域的自监督预训练(特别是使用 DINO 框架)可以得到性能更好的模型。在 ResNet50 和 Vision-Transformer-small 架构上,使用 DINO 进行自监督领域内预训练可分别使下游数据集的性能提高平均 1.63%和 4.62%。与在评估的任何框架中通过在自然图像上预训练获得的最佳性能相比,这是一种衡量。此外,领域内预训练的模型还表现出对失真干扰(噪声、对比度、模糊等)的更强鲁棒性,在 DINO 的领域内预训练的 ResNet50 和 Vision-Transformer-small 平均分别在性能指标上提高了 1.28%和 3.55%,与在自然图像上预训练的模型中找到的最佳性能相比。总体而言,本研究强调了领域内预训练对于提高医学图像分析中深度学习的通用性、可扩展性和性能的重要性。GastroNet-5M 预训练权重在我们的存储库中公开提供:huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights。