• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

细胞到句子:向大语言模型传授生物学语言。

Cell2Sentence: Teaching Large Language Models the Language of Biology.

作者信息

Levine Daniel, Rizvi Syed Asad, Lévy Sacha, Pallikkavaliyaveetil Nazreen, Zhang David, Chen Xingyu, Ghadermarzi Sina, Wu Ruiming, Zheng Zihe, Vrkic Ivan, Zhong Anna, Raskin Daphne, Han Insu, de Oliveira Fonseca Antonio Henrique, Caro Josue Ortega, Karbasi Amin, Dhodapkar Rahul M, van Dijk David

机构信息

Department of Computer Science, Yale University, New Haven, CT, USA.

School of Engineering Applied Science, University of Pennsylvania, Philadelphia, PA, USA.

出版信息

bioRxiv. 2024 Oct 29:2023.09.11.557287. doi: 10.1101/2023.09.11.557287.

DOI:10.1101/2023.09.11.557287
PMID:39554079
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11565894/
Abstract

We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into "cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the finetuning of language models for diverse tasks in biology, including cell generation, complex celltype annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S fine-tuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications.

摘要

我们介绍了Cell2Sentence(C2S),这是一种将大型语言模型直接应用于生物学背景,特别是单细胞转录组学的新方法。通过将基因表达数据转化为“细胞句子”,C2S弥合了自然语言处理与生物学之间的差距。我们证明细胞句子能够对语言模型进行微调,以完成生物学中的各种任务,包括细胞生成、复杂细胞类型注释以及直接的数据驱动文本生成。我们的实验表明,当使用C2S进行微调时,GPT-2可以根据细胞类型输入生成生物学上有效的细胞,并从细胞句子中准确预测细胞类型。这表明,通过C2S微调,语言模型在保持强大文本生成能力的同时,可以对单细胞生物学有显著的理解。C2S提供了一个灵活、易用的框架,将自然语言处理与转录组学相结合,利用现有模型和库进行广泛的生物学应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/1a0f5de14ff4/nihpp-2023.09.11.557287v4-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/51a2465bfec8/nihpp-2023.09.11.557287v4-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/856ad4ecf93f/nihpp-2023.09.11.557287v4-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/a60e1a50582c/nihpp-2023.09.11.557287v4-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/2bef176a4715/nihpp-2023.09.11.557287v4-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/dd094905f40f/nihpp-2023.09.11.557287v4-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/4212434cd7ba/nihpp-2023.09.11.557287v4-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/bd12f12ab823/nihpp-2023.09.11.557287v4-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/2dcd6d4cdf87/nihpp-2023.09.11.557287v4-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/ffea7a58223f/nihpp-2023.09.11.557287v4-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/c972dcfd9956/nihpp-2023.09.11.557287v4-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/0a0b28ac9a5b/nihpp-2023.09.11.557287v4-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/1a0f5de14ff4/nihpp-2023.09.11.557287v4-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/51a2465bfec8/nihpp-2023.09.11.557287v4-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/856ad4ecf93f/nihpp-2023.09.11.557287v4-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/a60e1a50582c/nihpp-2023.09.11.557287v4-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/2bef176a4715/nihpp-2023.09.11.557287v4-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/dd094905f40f/nihpp-2023.09.11.557287v4-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/4212434cd7ba/nihpp-2023.09.11.557287v4-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/bd12f12ab823/nihpp-2023.09.11.557287v4-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/2dcd6d4cdf87/nihpp-2023.09.11.557287v4-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/ffea7a58223f/nihpp-2023.09.11.557287v4-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/c972dcfd9956/nihpp-2023.09.11.557287v4-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/0a0b28ac9a5b/nihpp-2023.09.11.557287v4-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9af/11565894/1a0f5de14ff4/nihpp-2023.09.11.557287v4-f0005.jpg

相似文献

1
Cell2Sentence: Teaching Large Language Models the Language of Biology.细胞到句子:向大语言模型传授生物学语言。
bioRxiv. 2024 Oct 29:2023.09.11.557287. doi: 10.1101/2023.09.11.557287.
2
Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.用于命名实体识别任务的大语言模型微调的样本量考量:方法学研究
JMIR AI. 2024 May 16;3:e52095. doi: 10.2196/52095.
3
A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records.基于大型语言模型的生成式自然语言处理框架,在临床笔记上进行了微调,能够从电子健康记录中准确提取头痛频率。
Headache. 2024 Apr;64(4):400-409. doi: 10.1111/head.14702. Epub 2024 Mar 25.
4
Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.大型语言模型在命名实体识别中的性能与可重复性:在受控环境中使用的考量
Drug Saf. 2025 Mar;48(3):287-303. doi: 10.1007/s40264-024-01499-1. Epub 2024 Dec 11.
5
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
6
A Large Language Model-Based Generative Natural Language Processing Framework Finetuned on Clinical Notes Accurately Extracts Headache Frequency from Electronic Health Records.一种基于大语言模型的生成式自然语言处理框架,在临床笔记上进行微调后,能准确从电子健康记录中提取头痛频率。
medRxiv. 2023 Oct 3:2023.10.02.23296403. doi: 10.1101/2023.10.02.23296403.
7
Ascle-A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation Study.Ascle-A 是一个用于医疗文本生成的 Python 自然语言处理工具包:开发和评估研究。
J Med Internet Res. 2024 Oct 3;26:e60601. doi: 10.2196/60601.
8
Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks.太乙:一个用于多种生物医学任务的双语精调大型语言模型。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1865-1874. doi: 10.1093/jamia/ocae037.
9
Relation extraction using large language models: a case study on acupuncture point locations.基于大语言模型的关系抽取研究:以穴位定位为例。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2622-2631. doi: 10.1093/jamia/ocae233.
10
Enhancing semantical text understanding with fine-tuned large language models: A case study on Quora Question Pair duplicate identification.使用微调的大语言模型增强语义文本理解:以Quora问题对重复识别为例的研究
PLoS One. 2025 Jan 10;20(1):e0317042. doi: 10.1371/journal.pone.0317042. eCollection 2025.

本文引用的文献

1
scDiffusion: conditional generation of high-quality single-cell data using diffusion model.scDiffusion:使用扩散模型生成高质量单细胞数据的条件生成。
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae518.
2
scGPT: toward building a foundation model for single-cell multi-omics using generative AI.scGPT:迈向使用生成式人工智能构建单细胞多组学基础模型
Nat Methods. 2024 Aug;21(8):1470-1480. doi: 10.1038/s41592-024-02201-0. Epub 2024 Feb 26.
3
Causal identification of single-cell experimental perturbation effects with CINEMA-OT.
利用 CINEMA-OT 进行单细胞实验扰动影响的因果识别。
Nat Methods. 2023 Nov;20(11):1769-1779. doi: 10.1038/s41592-023-02040-5. Epub 2023 Nov 2.
4
Transfer learning enables predictions in network biology.迁移学习可实现网络生物学预测。
Nature. 2023 Jun;618(7965):616-624. doi: 10.1038/s41586-023-06139-9. Epub 2023 May 31.
5
Cross-tissue immune cell analysis reveals tissue-specific features in humans.跨组织免疫细胞分析揭示人类组织特异性特征。
Science. 2022 May 13;376(6594):eabl5197. doi: 10.1126/science.abl5197.
6
Cellular heterogeneity of human fallopian tubes in normal and hydrosalpinx disease states identified using scRNA-seq.使用 scRNA-seq 鉴定正常和输卵管积水疾病状态下人类输卵管的细胞异质性。
Dev Cell. 2022 Apr 11;57(7):914-929.e7. doi: 10.1016/j.devcel.2022.02.017. Epub 2022 Mar 22.
7
NeuCA web server: a neural network-based cell annotation tool with web-app and GUI.NeuCA 网络服务器:一个基于神经网络的细胞注释工具,具有网络应用程序和图形用户界面。
Bioinformatics. 2022 Apr 12;38(8):2361-2363. doi: 10.1093/bioinformatics/btac108.
8
The GTEx Consortium atlas of genetic regulatory effects across human tissues.GTEx 联盟人类组织遗传调控效应图谱
Science. 2020 Sep 11;369(6509):1318-1330. doi: 10.1126/science.aaz1776.
9
Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19.COVID-19 和流感的免疫表型分析突出了 I 型干扰素在 COVID-19 重症发展中的作用。
Sci Immunol. 2020 Jul 10;5(49). doi: 10.1126/sciimmunol.abd1554.
10
Integrated scRNA-Seq Identifies Human Postnatal Thymus Seeding Progenitors and Regulatory Dynamics of Differentiating Immature Thymocytes.单细胞 RNA 测序鉴定人类出生后胸腺定植祖细胞和分化中未成熟胸腺细胞的调控动态。
Immunity. 2020 Jun 16;52(6):1088-1104.e6. doi: 10.1016/j.immuni.2020.03.019. Epub 2020 Apr 17.