• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

单细胞生物学中深度学习模型训练数据构成的影响

Consequences of training data composition for deep learning models in single-cell biology.

作者信息

Nadig Ajay, Thoutam Akshaya, Hughes Madeline, Gupta Anay, Navia Andrew W, Fusi Nicolo, Raghavan Srivatsan, Winter Peter S, Amini Ava P, Crawford Lorin

机构信息

Harvard Medical School, Boston, MA, USA.

Massachusetts General Hospital, Boston, MA, USA.

出版信息

bioRxiv. 2025 Feb 24:2025.02.19.639127. doi: 10.1101/2025.02.19.639127.

DOI:10.1101/2025.02.19.639127
PMID:40060416
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11888162/
Abstract

Foundation models for single-cell transcriptomics have the potential to augment (or replace) purpose-built tools for a variety of common analyses, especially when data are sparse. Recent work with large language models has shown that training data composition greatly shapes performance; however, to date, single-cell foundation models have ignored this aspect, opting instead to train on the largest possible corpus. We systematically investigate the consequences of training dataset composition on the behavior of deep learning models of single-cell transcriptomics, focusing on human hematopoiesis as a tractable model system and including cells from adult and developing tissues, disease states, and perturbation atlases. We find that (1) these models generalize poorly to unseen cell types, (2) adding malignant cells to a healthy cell training corpus does not necessarily improve modeling of unseen malignant cells, and (3) including an embryonic stem cell differentiation atlas during training improves performance on out-of-distribution tasks. Our results emphasize the importance of diverse training data and suggest strategies to optimize future single-cell foundation models.

摘要

单细胞转录组学的基础模型有潜力增强(或取代)用于各种常见分析的专用工具,尤其是在数据稀疏的情况下。最近对大语言模型的研究表明,训练数据的构成对性能有很大影响;然而,迄今为止,单细胞基础模型忽略了这一方面,而是选择在尽可能大的语料库上进行训练。我们系统地研究了训练数据集构成对单细胞转录组学深度学习模型行为的影响,将人类造血作为一个易于处理的模型系统,纳入来自成人和发育中组织、疾病状态以及扰动图谱的细胞。我们发现:(1)这些模型对未见细胞类型的泛化能力较差;(2)在健康细胞训练语料库中添加恶性细胞不一定能改善对未见恶性细胞的建模;(3)在训练期间纳入胚胎干细胞分化图谱可提高对分布外任务的性能。我们的结果强调了多样化训练数据的重要性,并提出了优化未来单细胞基础模型的策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5925/11888162/ea4ae6af0eb6/nihpp-2025.02.19.639127v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5925/11888162/e00d077f8b37/nihpp-2025.02.19.639127v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5925/11888162/62fe62227237/nihpp-2025.02.19.639127v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5925/11888162/ea4ae6af0eb6/nihpp-2025.02.19.639127v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5925/11888162/e00d077f8b37/nihpp-2025.02.19.639127v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5925/11888162/62fe62227237/nihpp-2025.02.19.639127v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5925/11888162/ea4ae6af0eb6/nihpp-2025.02.19.639127v1-f0003.jpg

相似文献

1
Consequences of training data composition for deep learning models in single-cell biology.单细胞生物学中深度学习模型训练数据构成的影响
bioRxiv. 2025 Feb 24:2025.02.19.639127. doi: 10.1101/2025.02.19.639127.
2
Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型,对于使用可穿戴设备进行压力预测具有良好的泛化能力。
J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.
3
Generalization challenges in electrocardiogram deep learning: insights from dataset characteristics and attention mechanism.心电图深度学习中的泛化挑战:来自数据集特征和注意力机制的见解。
Future Cardiol. 2024 Mar 11;20(4):209-220. doi: 10.1080/14796678.2024.2354082. Epub 2024 Jun 5.
4
Generalizing Deep Learning for Medical Image Segmentation to Unseen Domains via Deep Stacked Transformation.通过深度堆叠变换将深度学习用于医学图像分割推广到未见领域。
IEEE Trans Med Imaging. 2020 Jul;39(7):2531-2540. doi: 10.1109/TMI.2020.2973595. Epub 2020 Feb 12.
5
On the objectivity, reliability, and validity of deep learning enabled bioimage analyses.深度学习赋能的生物影像分析的客观性、可靠性和有效性。
Elife. 2020 Oct 19;9:e59780. doi: 10.7554/eLife.59780.
6
Scaling cross-tissue single-cell annotation models.扩展跨组织单细胞注释模型。
bioRxiv. 2023 Oct 10:2023.10.07.561331. doi: 10.1101/2023.10.07.561331.
7
scTab: Scaling cross-tissue single-cell annotation models.scTab:缩放跨组织单细胞注释模型。
Nat Commun. 2024 Aug 4;15(1):6611. doi: 10.1038/s41467-024-51059-5.
8
Evaluating generalizability of artificial intelligence models for molecular datasets.评估人工智能模型对分子数据集的可推广性。
bioRxiv. 2024 Feb 28:2024.02.25.581982. doi: 10.1101/2024.02.25.581982.
9
Semi-supervised training of deep convolutional neural networks with heterogeneous data and few local annotations: An experiment on prostate histopathology image classification.基于异构数据和少量局部标注的深度卷积神经网络的半监督学习:前列腺组织病理学图像分类实验。
Med Image Anal. 2021 Oct;73:102165. doi: 10.1016/j.media.2021.102165. Epub 2021 Jul 14.
10
Machine learning for perturbational single-cell omics.用于扰动单细胞组学的机器学习
Cell Syst. 2021 Jun 16;12(6):522-537. doi: 10.1016/j.cels.2021.05.016.

本文引用的文献

1
The Curated Cancer Cell Atlas provides a comprehensive characterization of tumors at single-cell resolution.《精心策划的癌细胞图谱》以单细胞分辨率对肿瘤进行了全面表征。
Nat Cancer. 2025 May 8. doi: 10.1038/s43018-025-00957-8.
2
CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data.CZ CELLxGENE发现平台:一个用于对聚合数据进行可扩展探索、分析和建模的单细胞数据平台。
Nucleic Acids Res. 2025 Jan 6;53(D1):D886-D900. doi: 10.1093/nar/gkae1142.
3
Inference and applications of ancestral recombination graphs.
祖先重组图的推断与应用
Nat Rev Genet. 2025 Jan;26(1):47-58. doi: 10.1038/s41576-024-00772-4. Epub 2024 Sep 30.
4
Toward a foundation model of causal cell and tissue biology with a Perturbation Cell and Tissue Atlas.用扰动细胞和组织图谱构建因果细胞和组织生物学的基础模型。
Cell. 2024 Aug 22;187(17):4520-4545. doi: 10.1016/j.cell.2024.07.035.
5
scTab: Scaling cross-tissue single-cell annotation models.scTab:缩放跨组织单细胞注释模型。
Nat Commun. 2024 Aug 4;15(1):6611. doi: 10.1038/s41467-024-51059-5.
6
A developmental constraint model of cancer cell states and tumor heterogeneity.癌症细胞状态和肿瘤异质性的发育约束模型。
Cell. 2024 Jun 6;187(12):2907-2918. doi: 10.1016/j.cell.2024.04.032.
7
Large-scale foundation model on single-cell transcriptomics.单细胞转录组学的大规模基础模型。
Nat Methods. 2024 Aug;21(8):1481-1491. doi: 10.1038/s41592-024-02305-7. Epub 2024 Jun 6.
8
The future of rapid and automated single-cell data analysis using reference mapping.基于参考映射的高通量、自动化单细胞数据分析的未来。
Cell. 2024 May 9;187(10):2343-2358. doi: 10.1016/j.cell.2024.03.009.
9
scGPT: toward building a foundation model for single-cell multi-omics using generative AI.scGPT:迈向使用生成式人工智能构建单细胞多组学基础模型
Nat Methods. 2024 Aug;21(8):1470-1480. doi: 10.1038/s41592-024-02201-0. Epub 2024 Feb 26.
10
Transcriptomic diversity of cell types across the adult human brain.成人脑中细胞类型的转录组多样性。
Science. 2023 Oct 13;382(6667):eadd7046. doi: 10.1126/science.add7046.