Levine Daniel, Rizvi Syed Asad, Lévy Sacha, Pallikkavaliyaveetil Nazreen, Zhang David, Chen Xingyu, Ghadermarzi Sina, Wu Ruiming, Zheng Zihe, Vrkic Ivan, Zhong Anna, Raskin Daphne, Han Insu, de Oliveira Fonseca Antonio Henrique, Caro Josue Ortega, Karbasi Amin, Dhodapkar Rahul M, van Dijk David
Department of Computer Science, Yale University, New Haven, CT, USA.
School of Engineering Applied Science, University of Pennsylvania, Philadelphia, PA, USA.
bioRxiv. 2024 Oct 29:2023.09.11.557287. doi: 10.1101/2023.09.11.557287.
We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into "cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the finetuning of language models for diverse tasks in biology, including cell generation, complex celltype annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S fine-tuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications.
我们介绍了Cell2Sentence(C2S),这是一种将大型语言模型直接应用于生物学背景,特别是单细胞转录组学的新方法。通过将基因表达数据转化为“细胞句子”,C2S弥合了自然语言处理与生物学之间的差距。我们证明细胞句子能够对语言模型进行微调,以完成生物学中的各种任务,包括细胞生成、复杂细胞类型注释以及直接的数据驱动文本生成。我们的实验表明,当使用C2S进行微调时,GPT-2可以根据细胞类型输入生成生物学上有效的细胞,并从细胞句子中准确预测细胞类型。这表明,通过C2S微调,语言模型在保持强大文本生成能力的同时,可以对单细胞生物学有显著的理解。C2S提供了一个灵活、易用的框架,将自然语言处理与转录组学相结合,利用现有模型和库进行广泛的生物学应用。