Jiang Jue, Rangnekar Aneesh, Veeraraghavan Harini
Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, New York, USA.
Med Phys. 2025 Mar;52(3):1573-1588. doi: 10.1002/mp.17541. Epub 2024 Dec 5.
Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses curated downstream task dataset for both pretraining and fine-tuning. Availability of large, diverse, and uncurated public medical image sets presents the opportunity to potentially create foundation models by applying SSL in the "wild" that are robust to imaging variations. However, the benefit of wild- versus self-pretraining has not been studied for medical image analysis.
Compare robustness of wild versus self-pretrained models created using convolutional neural network (CNN) and transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models for non-small cell lung cancer (NSCLC) segmentation from 3D computed tomography (CT) scans.
CNN, ViT, and Swin models were wild-pretrained using unlabeled 10,412 3D CTs sourced from the cancer imaging archive and internal datasets. Self-pretraining was applied to same networks using a curated public downstream task dataset (n = 377) of patients with NSCLC. Pretext tasks introduced in self-distilled masked image transformer were used for both pretraining approaches. All models were fine-tuned to segment NSCLC (n = 377 training dataset) and tested on two separate datasets containing early (public n = 156) and advanced stage (internal n = 196) NSCLC. Models were evaluated in terms of: (a) accuracy, (b) robustness to image differences from contrast, slice thickness, and reconstruction kernels, and (c) impact of pretext tasks for pretraining. Feature reuse was evaluated using centered kernel alignment.
Wild-pretrained Swin models resulted in higher feature reuse at earlier level layers and increased feature differentiation close to output. Wild-pretrained Swin outperformed self-pretrained models for analyzed imaging acquisitions. Neither ViT nor CNN showed a clear benefit of wild-pretraining compared to self-pretraining. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information.
Wild-pretrained Swin networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. ViT and CNN models did not show a clear benefit for wild-pretraining over self-pretraining.
自监督学习(SSL)是一种从无标签数据中提取有用特征表示的方法,并能够在有限的有标签示例上对下游任务进行微调。自预训练是一种SSL方法,它使用经过整理的下游任务数据集进行预训练和微调。大量、多样且未经整理的公共医学图像集的可用性为通过在“自然环境”中应用SSL来创建对成像变化具有鲁棒性的基础模型提供了机会。然而,对于医学图像分析,尚未研究自然环境预训练与自预训练的优势。
比较使用卷积神经网络(CNN)和Transformer(视觉Transformer [ViT] 和分层移位窗口 [Swin])模型从3D计算机断层扫描(CT)图像中对非小细胞肺癌(NSCLC)进行分割时,自然环境预训练与自预训练模型的鲁棒性。
使用从癌症成像存档和内部数据集中获取的10412个无标签3D CT对CNN、ViT和Swin模型进行自然环境预训练。使用一个经过整理的NSCLC患者公共下游任务数据集(n = 377)对相同网络进行自预训练。自蒸馏掩码图像Transformer中引入的预训练任务用于两种预训练方法。所有模型都针对NSCLC分割(n = 377训练数据集)进行微调,并在两个单独的数据集上进行测试,这两个数据集分别包含早期(公共n = 156)和晚期(内部n = 196)NSCLC。根据以下方面对模型进行评估:(a)准确性,(b)对来自对比度、切片厚度和重建内核的图像差异的鲁棒性,以及(c)预训练任务对预训练的影响。使用中心核对齐评估特征重用。
自然环境预训练的Swin模型在较早的层中导致更高的特征重用,并在接近输出时增加了特征分化。对于分析的成像采集,自然环境预训练的Swin模型优于自预训练模型。与自预训练相比,ViT和CNN均未显示出自然环境预训练的明显优势。与对全局图像信息进行建模的对比任务相比,强制网络学习局部结构的掩码图像预测预训练任务导致更高的准确性。
对于肺肿瘤分割,自然环境预训练的Swin网络比自预训练方法对分析的CT成像差异更具鲁棒性。ViT和CNN模型在自然环境预训练方面没有显示出比自预训练更明显的优势。