Nadig Ajay, Thoutam Akshaya, Hughes Madeline, Gupta Anay, Navia Andrew W, Fusi Nicolo, Raghavan Srivatsan, Winter Peter S, Amini Ava P, Crawford Lorin
Harvard Medical School, Boston, MA, USA.
Massachusetts General Hospital, Boston, MA, USA.
bioRxiv. 2025 Feb 24:2025.02.19.639127. doi: 10.1101/2025.02.19.639127.
Foundation models for single-cell transcriptomics have the potential to augment (or replace) purpose-built tools for a variety of common analyses, especially when data are sparse. Recent work with large language models has shown that training data composition greatly shapes performance; however, to date, single-cell foundation models have ignored this aspect, opting instead to train on the largest possible corpus. We systematically investigate the consequences of training dataset composition on the behavior of deep learning models of single-cell transcriptomics, focusing on human hematopoiesis as a tractable model system and including cells from adult and developing tissues, disease states, and perturbation atlases. We find that (1) these models generalize poorly to unseen cell types, (2) adding malignant cells to a healthy cell training corpus does not necessarily improve modeling of unseen malignant cells, and (3) including an embryonic stem cell differentiation atlas during training improves performance on out-of-distribution tasks. Our results emphasize the importance of diverse training data and suggest strategies to optimize future single-cell foundation models.
单细胞转录组学的基础模型有潜力增强(或取代)用于各种常见分析的专用工具,尤其是在数据稀疏的情况下。最近对大语言模型的研究表明,训练数据的构成对性能有很大影响;然而,迄今为止,单细胞基础模型忽略了这一方面,而是选择在尽可能大的语料库上进行训练。我们系统地研究了训练数据集构成对单细胞转录组学深度学习模型行为的影响,将人类造血作为一个易于处理的模型系统,纳入来自成人和发育中组织、疾病状态以及扰动图谱的细胞。我们发现:(1)这些模型对未见细胞类型的泛化能力较差;(2)在健康细胞训练语料库中添加恶性细胞不一定能改善对未见恶性细胞的建模;(3)在训练期间纳入胚胎干细胞分化图谱可提高对分布外任务的性能。我们的结果强调了多样化训练数据的重要性,并提出了优化未来单细胞基础模型的策略。