Jiang Douglas, Dai Zilin, Zhang Luxuan, Yu Qiyi, Sun Haoqi, Tian Feng
ArXiv. 2025 May 12:arXiv:2505.07896v1.
Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their corresponding NCBI gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI's text-embedding-ada-002, textembedding-3-small and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top-N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell type clustering, cell vulnerability dissection, and trajectory inference.
通过单细胞水平测序数据来理解细胞身份和功能仍然是计算生物学中的一项关键挑战。我们提出了一个新颖的框架,该框架利用来自NCBI基因数据库的基因特异性文本注释来生成具有生物学背景的细胞嵌入。对于单细胞RNA测序(scRNA-seq)数据集中的每个细胞,我们按表达水平对基因进行排名,检索其相应的NCBI基因描述,并使用大语言模型(LLMs)将这些描述转换为向量嵌入表示。所使用的模型包括OpenAI的text-embedding-ada-002、textembedding-3-small和text-embedding-3-large(2024年1月),以及特定领域模型BioBERT和SciBERT。通过对每个细胞中表达最高的前N个基因进行表达加权平均来计算嵌入,从而提供一个紧凑、语义丰富的表示。这种多模态策略将结构化生物数据与最先进的语言建模联系起来,实现了更具可解释性的下游应用,如细胞类型聚类、细胞脆弱性剖析和轨迹推断。