Xiang Jinxi, Wang Xiyue, Zhang Xiaoming, Xi Yinghua, Eweje Feyisope, Chen Yijiang, Li Yuchen, Bergstrom Colin, Gopaulchan Matthew, Kim Ted, Yu Kun-Hsing, Willens Sierra, Olguin Francesca Maria, Nirschl Jeffrey J, Neal Joel, Diehn Maximilian, Yang Sen, Li Ruijiang
Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA.
Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA.
Nature. 2025 Feb;638(8051):769-778. doi: 10.1038/s41586-024-08378-w. Epub 2025 Jan 8.
Clinical decision-making is driven by multimodal data, including clinical notes and pathological characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise in advancing clinical care. However, the scarcity of well-annotated multimodal datasets in clinical settings has hindered the development of useful models. In this study, we developed the Multimodal transformer with Unified maSKed modeling (MUSK), a vision-language foundation model designed to leverage large-scale, unlabelled, unpaired image and text data. MUSK was pretrained on 50 million pathology images from 11,577 patients and one billion pathology-related text tokens using unified masked modelling. It was further pretrained on one million pathology image-text pairs to efficiently align the vision and language features. With minimal or no further training, MUSK was tested in a wide range of applications and demonstrated superior performance across 23 patch-level and slide-level benchmarks, including image-to-text and text-to-image retrieval, visual question answering, image classification and molecular biomarker prediction. Furthermore, MUSK showed strong performance in outcome prediction, including melanoma relapse prediction, pan-cancer prognosis prediction and immunotherapy response prediction in lung and gastro-oesophageal cancers. MUSK effectively combined complementary information from pathology images and clinical reports and could potentially improve diagnosis and precision in cancer therapy.
临床决策由多模态数据驱动,包括临床记录和病理特征。能够有效整合多模态数据的人工智能方法在推进临床护理方面具有重大前景。然而,临床环境中注释良好的多模态数据集的稀缺阻碍了有用模型的开发。在本研究中,我们开发了具有统一掩码建模的多模态变换器(MUSK),这是一种旨在利用大规模、未标记、未配对的图像和文本数据的视觉语言基础模型。MUSK使用统一掩码建模在来自11577名患者的5000万张病理图像和10亿个与病理相关的文本标记上进行了预训练。它在100万个病理图像-文本对上进一步预训练,以有效地对齐视觉和语言特征。经过最少或无需进一步训练,MUSK在广泛的应用中进行了测试,并在23个斑块级和玻片级基准测试中表现出色,包括图像到文本和文本到图像检索、视觉问答、图像分类和分子生物标志物预测。此外,MUSK在结果预测方面表现出色,包括黑色素瘤复发预测、泛癌预后预测以及肺癌和胃食管癌的免疫治疗反应预测。MUSK有效地结合了病理图像和临床报告中的互补信息,并有可能提高癌症治疗的诊断和精准度。
Cochrane Database Syst Rev. 2024-10-17
Cochrane Database Syst Rev. 2022-5-20
JAMA Netw Open. 2024-10-1
Cochrane Database Syst Rev. 2014-11-24
Health Technol Assess. 2006-9
Cochrane Database Syst Rev. 2023-11-15
Front Pharmacol. 2025-8-20
Innovation (Camb). 2025-5-12
J Clin Med. 2025-5-8
Nature. 2024-10
Adv Neural Inf Process Syst. 2023-12