scELMo：来自语言模型的嵌入是单细胞数据分析的优秀学习者。

scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis.

作者信息

Liu Tianyu, Chen Tianqi, Zheng Wangjie, Luo Xiao, Chen Yiqun, Zhao Hongyu

出版信息

bioRxiv. 2025 Aug 23:2023.12.07.569910. doi: 10.1101/2023.12.07.569910.

DOI:10.1101/2023.12.07.569910

PMID:40894586

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12393277/

Abstract

Various Foundation Models (FMs) have been built based on the pre-training and fine-tuning framework to analyze single-cell data with different degrees of success. In this manuscript, we propose a method named scELMo (Single-cell Embedding from Language Models), to analyze single-cell data that utilizes Large Language Models (LLMs) as a generator for both the description of metadata information and the embeddings for such descriptions. We combine the embeddings from LLMs with the raw data under the zero-shot learning framework to further extend its function by using the fine-tuning framework to handle different tasks. We demonstrate that scELMo is capable of cell clustering, batch effect correction, and cell-type annotation without training a new model. Moreover, the fine-tuning framework of scELMo can help with more challenging tasks including in-silico treatment analysis or modeling perturbation. scELMo has a lighter structure and lower requirements for resources. Our method also outperforms recent large-scale FMs (such as scGPT [1], Geneformer [2]) and other LLM-based single-cell data analysis pipelines (such as GenePT [3] and GPTCelltype [4]) based on our evaluations, suggesting a promising path for developing domain-specific FMs.

摘要

基于预训练和微调框架构建了各种基础模型（FMs），用于分析单细胞数据，取得了不同程度的成功。在本论文中，我们提出了一种名为scELMo（基于语言模型的单细胞嵌入）的方法，用于分析单细胞数据，该方法利用大语言模型（LLMs）作为元数据信息描述及其嵌入的生成器。我们在零样本学习框架下将来自LLMs的嵌入与原始数据相结合，并通过使用微调框架处理不同任务来进一步扩展其功能。我们证明scELMo能够在不训练新模型的情况下进行细胞聚类、批次效应校正和细胞类型注释。此外，scELMo的微调框架有助于处理更具挑战性的任务，包括虚拟治疗分析或建模扰动。scELMo结构更轻，对资源的要求更低。基于我们的评估，我们的方法还优于最近的大规模FMs（如scGPT [1]、Geneformer [2]）以及其他基于LLM的单细胞数据分析管道（如GenePT [3]和GPTCelltype [4]），为开发特定领域的FMs开辟了一条有前景的道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e439/12393277/714946e13cb9/nihpp-2023.12.07.569910v4-f0001.jpg

相似文献

scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis.

bioRxiv. 2025 Aug 23:2023.12.07.569910. doi: 10.1101/2023.12.07.569910.

Prescription of Controlled Substances: Benefits and Risks

Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction.

J Med Internet Res. 2025 Jul 8;27:e75347. doi: 10.2196/75347.

Short-Term Memory Impairment

Developing healthcare language model embedding spaces.

Artif Intell Med. 2024 Dec;158:103009. doi: 10.1016/j.artmed.2024.103009. Epub 2024 Oct 31.

A dataset and benchmark for hospital course summarization with adapted large language models.

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Algorithmic Classification of Psychiatric Disorder-Related Spontaneous Communication Using Large Language Model Embeddings: Algorithm Development and Validation.

JMIR AI. 2025 May 30;4:e67369. doi: 10.2196/67369.

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.

Drug Saf. 2025 Mar;48(3):287-303. doi: 10.1007/s40264-024-01499-1. Epub 2024 Dec 11.

本文引用的文献

Zero-shot evaluation reveals limitations of single-cell foundation models.

Genome Biol. 2025 Apr 18;26(1):101. doi: 10.1186/s13059-025-03574-x.

Simple and effective embedding model for single-cell biology built from ChatGPT.

Nat Biomed Eng. 2025 Apr;9(4):483-493. doi: 10.1038/s41551-024-01284-6. Epub 2024 Dec 6.

Large-scale foundation model on single-cell transcriptomics.

Nat Methods. 2024 Aug;21(8):1481-1491. doi: 10.1038/s41592-024-02305-7. Epub 2024 Jun 6.

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis.

Nat Methods. 2024 Aug;21(8):1462-1465. doi: 10.1038/s41592-024-02235-4. Epub 2024 Mar 25.

scGPT: toward building a foundation model for single-cell multi-omics using generative AI.

Nat Methods. 2024 Aug;21(8):1470-1480. doi: 10.1038/s41592-024-02201-0. Epub 2024 Feb 26.

A comparison of marker gene selection methods for single-cell RNA sequencing data.

Genome Biol. 2024 Feb 26;25(1):56. doi: 10.1186/s13059-024-03183-0.

Causal identification of single-cell experimental perturbation effects with CINEMA-OT.

Nat Methods. 2023 Nov;20(11):1769-1779. doi: 10.1038/s41592-023-02040-5. Epub 2023 Nov 2.

Predicting transcriptional outcomes of novel multigene perturbations with GEARS.

Nat Biotechnol. 2024 Jun;42(6):927-935. doi: 10.1038/s41587-023-01905-6. Epub 2023 Aug 17.

Biwhitening Reveals the Rank of a Count Matrix.

SIAM J Math Data Sci. 2022;4(4):1420-1446. doi: 10.1137/21m1456807.

ChatGPT: The transformative influence of generative AI on science and healthcare.

J Hepatol. 2024 Jun;80(6):977-980. doi: 10.1016/j.jhep.2023.07.028. Epub 2023 Aug 5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

scELMo：来自语言模型的嵌入是单细胞数据分析的优秀学习者。

scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis.

作者信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献