结合真实数据与合成数据以克服多模态学习中有限的训练数据集

Combining Real and Synthetic Data to Overcome Limited Training Datasets in Multimodal Learning.

作者信息

Marini Niccolo, Liang Zhaohui, Rajaraman Sivaramakrishnan, Xue Zhiyun, Antani Sameer

机构信息

Division of Intramural Research, National Library of Medicine, National Institutes of Health Bethesda, MD, 290894, USA.

出版信息

medRxiv. 2025 Jul 17:2025.07.16.25331662. doi: 10.1101/2025.07.16.25331662.

DOI:10.1101/2025.07.16.25331662

PMID:40791679

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12338939/

Abstract

Biomedical data are inherently multimodal, capturing complementary aspects of a patient condition. Deep learning (DL) algorithms that integrate multiple biomedical modalities can significantly improve clinical decision-making, especially in domains where collecting data is not simple and data are highly heterogeneous. However, developing effective and reliable multimodal DL methods remains challenging, requiring large training datasets with paired samples from modalities of interest. An increasing number of de-identifed biomedical datasets are publicly accessible, though they still tend to be unimodal. For example, several publicly available skin lesion datasets aid automated dermatology clinical decision-making. Still, they lack annotated reports paired with the images, thereby limiting the advance and use of multimodal DL algorithms. This work presents a strategy exploiting real and synthesized data in a multimodal architecture that encodes fine-grained text representations within image embeddings to create a robust representation of skin lesion data. Large language models (LLMs) are used to synthesize textual descriptions from image metadata that are subsequently paired with the original skin lesion images and used for model development. The architecture is evaluated on the classification of skin lesion images, considering nine internal and external data sources. The proposed multimodal representation outperforms the unimodal one on the classification of skin lesion images, achieving superior performance in every tested dataset.

摘要

生物医学数据本质上是多模态的，能够捕捉患者病情的互补方面。整合多种生物医学模态的深度学习（DL）算法可以显著改善临床决策，尤其是在数据收集不简单且数据高度异质的领域。然而，开发有效且可靠的多模态DL方法仍然具有挑战性，需要来自感兴趣模态的带有配对样本的大型训练数据集。越来越多的去识别生物医学数据集可公开获取，不过它们往往仍是单模态的。例如，几个公开可用的皮肤病变数据集有助于皮肤病学临床决策自动化。然而，它们缺乏与图像配对的注释报告，从而限制了多模态DL算法的发展和应用。这项工作提出了一种在多模态架构中利用真实数据和合成数据的策略，该架构在图像嵌入中编码细粒度文本表示，以创建皮肤病变数据的强大表示。大语言模型（LLMs）用于从图像元数据中合成文本描述，这些文本描述随后与原始皮肤病变图像配对并用于模型开发。该架构在考虑九个内部和外部数据源的情况下，对皮肤病变图像分类进行了评估。所提出的多模态表示在皮肤病变图像分类方面优于单模态表示，在每个测试数据集中都取得了卓越的性能。