Suppr超能文献

基于ChatGPT构建的用于单细胞生物学的简单有效嵌入模型。

Simple and effective embedding model for single-cell biology built from ChatGPT.

作者信息

Chen Yiqun, Zou James

机构信息

Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.

Department of Electrical Engineering, Stanford University, Stanford, CA, USA.

出版信息

Nat Biomed Eng. 2025 Apr;9(4):483-493. doi: 10.1038/s41551-024-01284-6. Epub 2024 Dec 6.

Abstract

Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene's expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models-particularly, tasks of gene-property and cell-type classifications-our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.

摘要

大规模基因表达数据正被用于预训练模型,这些模型能隐式学习基因和细胞功能。然而,此类模型需要大量的数据整理和训练。在此,我们探索一种简单得多的替代方法:利用基于文献的ChatGPT基因嵌入。我们使用GPT-3.5从单个基因的文本描述中生成基因嵌入,然后通过对按每个基因表达水平加权的基因嵌入求平均值来生成单细胞嵌入。我们还仅使用按表达水平排序的基因名称为每个细胞创建了一个句子嵌入。在许多用于评估预训练单细胞嵌入模型的下游任务中,特别是基因属性和细胞类型分类任务,我们名为GenePT的模型取得了与从数百万个细胞的基因表达谱预训练的模型相当或更好的性能。GenePT表明,文献的大语言模型嵌入为编码单细胞生物学知识提供了一条简单有效的途径。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验