Liu Tianyu, Chen Tianqi, Zheng Wangjie, Luo Xiao, Chen Yiqun, Zhao Hongyu
bioRxiv. 2025 Aug 23:2023.12.07.569910. doi: 10.1101/2023.12.07.569910.
Various Foundation Models (FMs) have been built based on the pre-training and fine-tuning framework to analyze single-cell data with different degrees of success. In this manuscript, we propose a method named scELMo (Single-cell Embedding from Language Models), to analyze single-cell data that utilizes Large Language Models (LLMs) as a generator for both the description of metadata information and the embeddings for such descriptions. We combine the embeddings from LLMs with the raw data under the zero-shot learning framework to further extend its function by using the fine-tuning framework to handle different tasks. We demonstrate that scELMo is capable of cell clustering, batch effect correction, and cell-type annotation without training a new model. Moreover, the fine-tuning framework of scELMo can help with more challenging tasks including in-silico treatment analysis or modeling perturbation. scELMo has a lighter structure and lower requirements for resources. Our method also outperforms recent large-scale FMs (such as scGPT [1], Geneformer [2]) and other LLM-based single-cell data analysis pipelines (such as GenePT [3] and GPTCelltype [4]) based on our evaluations, suggesting a promising path for developing domain-specific FMs.
基于预训练和微调框架构建了各种基础模型(FMs),用于分析单细胞数据,取得了不同程度的成功。在本论文中,我们提出了一种名为scELMo(基于语言模型的单细胞嵌入)的方法,用于分析单细胞数据,该方法利用大语言模型(LLMs)作为元数据信息描述及其嵌入的生成器。我们在零样本学习框架下将来自LLMs的嵌入与原始数据相结合,并通过使用微调框架处理不同任务来进一步扩展其功能。我们证明scELMo能够在不训练新模型的情况下进行细胞聚类、批次效应校正和细胞类型注释。此外,scELMo的微调框架有助于处理更具挑战性的任务,包括虚拟治疗分析或建模扰动。scELMo结构更轻,对资源的要求更低。基于我们的评估,我们的方法还优于最近的大规模FMs(如scGPT [1]、Geneformer [2])以及其他基于LLM的单细胞数据分析管道(如GenePT [3]和GPTCelltype [4]),为开发特定领域的FMs开辟了一条有前景的道路。