Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
Nat Med. 2024 Mar;30(3):863-874. doi: 10.1038/s41591-024-02856-4. Epub 2024 Mar 19.
The accelerated adoption of digital pathology and advances in deep learning have enabled the development of robust models for various pathology tasks across a diverse array of diseases and patient cohorts. However, model training is often difficult due to label scarcity in the medical domain, and a model's usage is limited by the specific task and disease for which it is trained. Additionally, most models in histopathology leverage only image data, a stark contrast to how humans teach each other and reason about histopathologic entities. We introduce CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text and, notably, over 1.17 million image-caption pairs through task-agnostic pretraining. Evaluated on a suite of 14 diverse benchmarks, CONCH can be transferred to a wide range of downstream tasks involving histopathology images and/or text, achieving state-of-the-art performance on histology image classification, segmentation, captioning, and text-to-image and image-to-text retrieval. CONCH represents a substantial leap over concurrent visual-language pretrained systems for histopathology, with the potential to directly facilitate a wide array of machine learning-based workflows requiring minimal or no further supervised fine-tuning.
数字病理学的加速采用和深度学习的进步使得针对各种病理学任务的强大模型得以开发,涵盖了广泛的疾病和患者群体。然而,由于医学领域标签稀缺,模型训练通常很困难,并且模型的使用受到其训练的特定任务和疾病的限制。此外,组织病理学中的大多数模型仅利用图像数据,这与人类相互教授和推理组织病理学实体的方式形成鲜明对比。我们引入了 CONtrastive learning from Captions for Histopathology (CONCH),这是一种使用组织病理学图像、生物医学文本以及重要的超过 117 万张图像-标题对的各种来源通过无任务预设训练开发的视觉-语言基础模型。在 14 个不同基准的套件上进行评估,CONCH 可以转移到广泛的下游任务,涉及组织病理学图像和/或文本,在组织学图像分类、分割、标题生成以及文本到图像和图像到文本检索方面实现了最先进的性能。CONCH 代表了组织病理学领域中同期视觉-语言预训练系统的重大飞跃,有可能直接促进需要最小或无需进一步监督微调的广泛的基于机器学习的工作流程。