基于ChatGPT构建的用于单细胞生物学的简单有效嵌入模型。

Simple and effective embedding model for single-cell biology built from ChatGPT.

作者信息

Chen Yiqun, Zou James

机构信息

Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.

Department of Electrical Engineering, Stanford University, Stanford, CA, USA.

出版信息

Nat Biomed Eng. 2025 Apr;9(4):483-493. doi: 10.1038/s41551-024-01284-6. Epub 2024 Dec 6.

DOI:10.1038/s41551-024-01284-6

PMID:39643729

Abstract

Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene's expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models-particularly, tasks of gene-property and cell-type classifications-our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.

摘要

大规模基因表达数据正被用于预训练模型，这些模型能隐式学习基因和细胞功能。然而，此类模型需要大量的数据整理和训练。在此，我们探索一种简单得多的替代方法：利用基于文献的ChatGPT基因嵌入。我们使用GPT-3.5从单个基因的文本描述中生成基因嵌入，然后通过对按每个基因表达水平加权的基因嵌入求平均值来生成单细胞嵌入。我们还仅使用按表达水平排序的基因名称为每个细胞创建了一个句子嵌入。在许多用于评估预训练单细胞嵌入模型的下游任务中，特别是基因属性和细胞类型分类任务，我们名为GenePT的模型取得了与从数百万个细胞的基因表达谱预训练的模型相当或更好的性能。GenePT表明，文献的大语言模型嵌入为编码单细胞生物学知识提供了一条简单有效的途径。

相似文献

Simple and effective embedding model for single-cell biology built from ChatGPT.基于ChatGPT构建的用于单细胞生物学的简单有效嵌入模型。

Nat Biomed Eng. 2025 Apr;9(4):483-493. doi: 10.1038/s41551-024-01284-6. Epub 2024 Dec 6.

GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT.GenePT：一种基于ChatGPT构建的用于基因和细胞的简单而有效的基础模型。

bioRxiv. 2024 Mar 5:2023.10.16.562533. doi: 10.1101/2023.10.16.562533.

scGPT: toward building a foundation model for single-cell multi-omics using generative AI.scGPT：迈向使用生成式人工智能构建单细胞多组学基础模型

Nat Methods. 2024 Aug;21(8):1470-1480. doi: 10.1038/s41592-024-02201-0. Epub 2024 Feb 26.

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.16S rRNA 序列嵌入：核苷酸序列有意义的数值特征表示形式，方便下游分析。

PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.

Large-scale foundation model on single-cell transcriptomics.单细胞转录组学的大规模基础模型。

Nat Methods. 2024 Aug;21(8):1481-1491. doi: 10.1038/s41592-024-02305-7. Epub 2024 Jun 6.

A best-match approach for gene set analyses in embedding spaces.一种在嵌入空间中进行基因集分析的最佳匹配方法。

Genome Res. 2024 Oct 11;34(9):1421-1433. doi: 10.1101/gr.279141.124.

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。

PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.

Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics.通过纠正和丰富生成的数据库查询来引导真实的大语言模型分析：迈向ChatGPT生物信息学的第一步。

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf045.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量：在大规模上创建和评估基于文献的生物医学概念嵌入。

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

Mapping the gene space at single-cell resolution with gene signal pattern analysis.利用基因信号模式分析在单细胞分辨率下绘制基因空间图谱。

Nat Comput Sci. 2024 Dec;4(12):955-977. doi: 10.1038/s43588-024-00734-0. Epub 2024 Dec 20.

引用本文的文献

scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis.scELMo：来自语言模型的嵌入是单细胞数据分析的优秀学习者。

bioRxiv. 2025 Aug 23:2023.12.07.569910. doi: 10.1101/2023.12.07.569910.

Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis.小型开源文本嵌入模型可替代OpenAI模型用于基因分析。

Comput Struct Biotechnol J. 2025 Aug 6;27:3598-3608. doi: 10.1016/j.csbj.2025.07.053. eCollection 2025.

LLM-based cell type annotation harmonization across single-cell studies using GCTHarmony.使用GCTHarmony在单细胞研究中基于大语言模型的细胞类型注释协调

Res Sq. 2025 Aug 12:rs.3.rs-7151095. doi: 10.21203/rs.3.rs-7151095/v1.

sciLaMA: A Single-Cell Representation Learning Framework to Leverage Prior Knowledge from Large Language Models.sciLaMA：一种利用大语言模型先验知识的单细胞表示学习框架。

bioRxiv. 2025 May 29:2025.01.28.635153. doi: 10.1101/2025.01.28.635153.

A visual-omics foundation model to bridge histopathology with spatial transcriptomics.一种将组织病理学与空间转录组学相连接的视觉组学基础模型。

Nat Methods. 2025 May 29. doi: 10.1038/s41592-025-02707-1.

A visual-omics foundation model to bridge histopathology image with transcriptomics.一种将组织病理学图像与转录组学相连接的视觉组学基础模型。

Res Sq. 2025 Apr 16:rs.3.rs-5183775. doi: 10.21203/rs.3.rs-5183775/v1.

New horizons at the interface of artificial intelligence and translational cancer research.人工智能与转化性癌症研究交叉领域的新视野。

Cancer Cell. 2025 Apr 14;43(4):708-727. doi: 10.1016/j.ccell.2025.03.018.

Small, Open-Source Text-Embedding Models as Substitutes to OpenAI Models for Gene Analysis.小型开源文本嵌入模型作为OpenAI模型在基因分析中的替代品

bioRxiv. 2025 Feb 20:2025.02.15.638462. doi: 10.1101/2025.02.15.638462.

EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment.EpiFoundation：一种通过峰到基因比对实现单细胞ATAC测序的基础模型。

bioRxiv. 2025 Feb 8:2025.02.05.636688. doi: 10.1101/2025.02.05.636688.

Benchmarking large language models for genomic knowledge with GeneTuring.使用GeneTuring对大型语言模型进行基因组知识基准测试。

bioRxiv. 2025 Jan 5:2023.03.11.532238. doi: 10.1101/2023.03.11.532238.

本文引用的文献

Large-scale foundation model on single-cell transcriptomics.单细胞转录组学的大规模基础模型。

Nat Methods. 2024 Aug;21(8):1481-1491. doi: 10.1038/s41592-024-02305-7. Epub 2024 Jun 6.

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis.在单细胞RNA测序分析中评估GPT-4用于细胞类型注释

Nat Methods. 2024 Aug;21(8):1462-1465. doi: 10.1038/s41592-024-02235-4. Epub 2024 Mar 25.

scGPT: toward building a foundation model for single-cell multi-omics using generative AI.scGPT：迈向使用生成式人工智能构建单细胞多组学基础模型

Nat Methods. 2024 Aug;21(8):1470-1480. doi: 10.1038/s41592-024-02201-0. Epub 2024 Feb 26.

Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations.聊天机器人与医学生在自由应答临床推理考试中的表现对比

JAMA Intern Med. 2023 Sep 1;183(9):1028-1030. doi: 10.1001/jamainternmed.2023.2909.

Transfer learning enables predictions in network biology.迁移学习可实现网络生物学预测。

Nature. 2023 Jun;618(7965):616-624. doi: 10.1038/s41586-023-06139-9. Epub 2023 May 31.

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。

JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.

Role of Chat GPT in Public Health.Chat GPT 在公共卫生中的作用。

Ann Biomed Eng. 2023 May;51(5):868-869. doi: 10.1007/s10439-023-03172-7. Epub 2023 Mar 15.

Transformer for one stop interpretable cell type annotation.用于一站式可解释细胞类型注释的 Transformer。

Nat Commun. 2023 Jan 14;14(1):223. doi: 10.1038/s41467-023-35923-4.

Decoding the transcriptome of calcified atherosclerotic plaque at single-cell resolution.解析单细胞分辨率下钙化动脉粥样硬化斑块的转录组。

Commun Biol. 2022 Oct 12;5(1):1084. doi: 10.1038/s42003-022-04056-7.

Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy.人类扩张型和肥厚型心肌病的单细胞分析。

Nature. 2022 Aug;608(7921):174-180. doi: 10.1038/s41586-022-04817-8. Epub 2022 Jun 22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于ChatGPT构建的用于单细胞生物学的简单有效嵌入模型。

Simple and effective embedding model for single-cell biology built from ChatGPT.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献