Wang Shansong, Safari Mojtaba, Li Qiang, Chang Chih-Wei, Qiu Richard Lj, Roper Justin, Yu David S, Yang Xiaofeng
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine.
Res Sq. 2025 Mar 10:rs.3.rs-6129856. doi: 10.21203/rs.3.rs-6129856/v1.
Vision foundation models (VFMs) are pre-trained on extensive image datasets to learn general representations for diverse types of data. These models can subsequently be fine-tuned for specific downstream tasks, significantly boosting performance across a broad range of applications. However, existing vision foundation models that claim to be applicable to various clinical tasks are mostly pre-trained on 3D computed tomography (CT), which benefits from the availability of extensive 3D CT databases. Significant differences between CT and magnetic resonance imaging (MRI) in imaging principles, signal characteristics, and data distribution may hinder their practical performance and versatility in MRI-specific applications. Here, we propose Triad, a vision foundation model for 3D MRI. Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes and uses organ-independent imaging descriptions to constrain the semantic distribution of the visual modality. The above pre-training dataset is called Triad-131K, which is currently the largest 3D MRI pre-training dataset. We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration, in two data modalities (within-domain and out-of-domain) settings using 25 downstream datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad improves segmentation performance by 2.51% compared to nnUNet-Scratch across 17 datasets. Swin-B-Triad achieves a 4.04% improvement over Swin-B-Scratch in classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% compared to SwinUNETR-Scratch in registration tasks across two datasets. Our study demonstrates that pre-training can improve performance when the data modalities and organs of upstream and downstream tasks are consistent. This work highlights the value of large-scale pre-training techniques for downstream tasks in 3D MRI. By open-sourcing Triad's weights, code, and data, we aim to enhance the adaptability and reliability of foundation models for 3D MRI in clinical tasks.
视觉基础模型(VFM)在大量图像数据集上进行预训练,以学习各种类型数据的通用表示。这些模型随后可以针对特定的下游任务进行微调,显著提高广泛应用中的性能。然而,现有的声称适用于各种临床任务的视觉基础模型大多是在三维计算机断层扫描(CT)上进行预训练的,这得益于大量三维CT数据库的可用性。CT和磁共振成像(MRI)在成像原理、信号特征和数据分布上的显著差异可能会阻碍它们在MRI特定应用中的实际性能和通用性。在此,我们提出了Triad,一种用于三维MRI的视觉基础模型。Triad采用广泛使用的自动编码器架构,从131170个三维MRI体积中学习鲁棒表示,并使用与器官无关的成像描述来约束视觉模态的语义分布。上述预训练数据集称为Triad-131K,它是目前最大的三维MRI预训练数据集。我们使用25个下游数据集,在两种数据模态(域内和域外)设置下,针对器官/肿瘤分割、器官/癌症分类和医学图像配准这三项任务对Triad进行评估。通过使用Triad的预训练权重初始化模型,与nnUNet-Scratch相比,nnUNet-Triad在17个数据集上的分割性能提高了2.51%。在五个数据集的分类任务中,Swin-B-Triad比Swin-B-Scratch提高了4.04%。在两个数据集的配准任务中,SwinUNETR-Triad比SwinUNETR-Scratch提高了4.00%。我们的研究表明,当上、下游任务的数据模态和器官一致时,预训练可以提高性能。这项工作突出了大规模预训练技术对三维MRI下游任务的价值。通过开源Triad的权重、代码和数据,我们旨在提高基础模型在临床任务中对三维MRI的适应性和可靠性。