• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在剖析选择性运动神经元易损性方面连接大语言模型与单细胞转录组学

Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability.

作者信息

Jiang Douglas, Dai Zilin, Zhang Luxuan, Yu Qiyi, Sun Haoqi, Tian Feng

出版信息

ArXiv. 2025 May 12:arXiv:2505.07896v1.

PMID:40463696
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12133077/
Abstract

Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their corresponding NCBI gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI's text-embedding-ada-002, textembedding-3-small and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top-N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell type clustering, cell vulnerability dissection, and trajectory inference.

摘要

通过单细胞水平测序数据来理解细胞身份和功能仍然是计算生物学中的一项关键挑战。我们提出了一个新颖的框架,该框架利用来自NCBI基因数据库的基因特异性文本注释来生成具有生物学背景的细胞嵌入。对于单细胞RNA测序(scRNA-seq)数据集中的每个细胞,我们按表达水平对基因进行排名,检索其相应的NCBI基因描述,并使用大语言模型(LLMs)将这些描述转换为向量嵌入表示。所使用的模型包括OpenAI的text-embedding-ada-002、textembedding-3-small和text-embedding-3-large(2024年1月),以及特定领域模型BioBERT和SciBERT。通过对每个细胞中表达最高的前N个基因进行表达加权平均来计算嵌入,从而提供一个紧凑、语义丰富的表示。这种多模态策略将结构化生物数据与最先进的语言建模联系起来,实现了更具可解释性的下游应用,如细胞类型聚类、细胞脆弱性剖析和轨迹推断。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/7c17167fd0c3/nihpp-2505.07896v1-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/595d7efd61f8/nihpp-2505.07896v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/c9bbd5628d9b/nihpp-2505.07896v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/768bf3e6b960/nihpp-2505.07896v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/b1ba8e2cbf13/nihpp-2505.07896v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/a3a82feffaaa/nihpp-2505.07896v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/9b362b59aa5c/nihpp-2505.07896v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/18caf1dbed73/nihpp-2505.07896v1-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/8a30b19f5880/nihpp-2505.07896v1-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/6e40a784c231/nihpp-2505.07896v1-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/030268d2c15c/nihpp-2505.07896v1-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/9e3dde7c38a0/nihpp-2505.07896v1-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/7c17167fd0c3/nihpp-2505.07896v1-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/595d7efd61f8/nihpp-2505.07896v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/c9bbd5628d9b/nihpp-2505.07896v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/768bf3e6b960/nihpp-2505.07896v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/b1ba8e2cbf13/nihpp-2505.07896v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/a3a82feffaaa/nihpp-2505.07896v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/9b362b59aa5c/nihpp-2505.07896v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/18caf1dbed73/nihpp-2505.07896v1-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/8a30b19f5880/nihpp-2505.07896v1-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/6e40a784c231/nihpp-2505.07896v1-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/030268d2c15c/nihpp-2505.07896v1-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/9e3dde7c38a0/nihpp-2505.07896v1-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aea8/12133077/7c17167fd0c3/nihpp-2505.07896v1-f0012.jpg

相似文献

1
Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability.在剖析选择性运动神经元易损性方面连接大语言模型与单细胞转录组学
ArXiv. 2025 May 12:arXiv:2505.07896v1.
2
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
3
Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.成人全身麻醉后预防术后恶心呕吐的药物:网状Meta分析
Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.
4
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
5
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。
Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.
6
sciLaMA: A Single-Cell Representation Learning Framework to Leverage Prior Knowledge from Large Language Models.sciLaMA:一种利用大语言模型先验知识的单细胞表示学习框架。
bioRxiv. 2025 May 29:2025.01.28.635153. doi: 10.1101/2025.01.28.635153.
7
Molecular feature-based classification of retroperitoneal liposarcoma: a prospective cohort study.基于分子特征的腹膜后脂肪肉瘤分类:一项前瞻性队列研究。
Elife. 2025 May 23;14:RP100887. doi: 10.7554/eLife.100887.
8
Dressings and topical agents for treating venous leg ulcers.用于治疗下肢静脉溃疡的敷料和外用剂。
Cochrane Database Syst Rev. 2018 Jun 15;6(6):CD012583. doi: 10.1002/14651858.CD012583.pub2.
9
Factors that influence parents' and informal caregivers' views and practices regarding routine childhood vaccination: a qualitative evidence synthesis.影响父母和非正式照顾者对常规儿童疫苗接种看法和做法的因素:定性证据综合分析。
Cochrane Database Syst Rev. 2021 Oct 27;10(10):CD013265. doi: 10.1002/14651858.CD013265.pub2.
10
Magnetic resonance perfusion for differentiating low-grade from high-grade gliomas at first presentation.首次就诊时磁共振灌注成像用于鉴别低级别与高级别胶质瘤
Cochrane Database Syst Rev. 2018 Jan 22;1(1):CD011551. doi: 10.1002/14651858.CD011551.pub2.

本文引用的文献

1
scNET: learning context-specific gene and cell embeddings by integrating single-cell gene expression data with protein-protein interactions.scNET:通过整合单细胞基因表达数据与蛋白质-蛋白质相互作用来学习特定背景下的基因和细胞嵌入
Nat Methods. 2025 Apr;22(4):708-716. doi: 10.1038/s41592-025-02627-0. Epub 2025 Mar 17.
2
Accurate structure prediction of biomolecular interactions with AlphaFold 3.利用 AlphaFold 3 进行生物分子相互作用的精确结构预测。
Nature. 2024 Jun;630(8016):493-500. doi: 10.1038/s41586-024-07487-w. Epub 2024 May 8.
3
Single-cell and spatial transcriptomics: deciphering brain complexity in health and disease.
单细胞和空间转录组学:解析健康和疾病中的大脑复杂性。
Nat Rev Neurol. 2023 Jun;19(6):346-362. doi: 10.1038/s41582-023-00809-y. Epub 2023 May 17.
4
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
5
Integrative in situ mapping of single-cell transcriptional states and tissue histopathology in a mouse model of Alzheimer's disease.阿尔茨海默病小鼠模型中单细胞转录状态和组织组织病理学的综合原位图谱绘制。
Nat Neurosci. 2023 Mar;26(3):430-446. doi: 10.1038/s41593-022-01251-x. Epub 2023 Feb 2.
6
A transcriptomic axis predicts state modulation of cortical interneurons.转录组轴预测皮质中间神经元状态的调节。
Nature. 2022 Jul;607(7918):330-338. doi: 10.1038/s41586-022-04915-7. Epub 2022 Jul 6.
7
Anc2vec: embedding gene ontology terms by preserving ancestors relationships.Anc2vec:通过保留祖先关系来嵌入基因本体论术语。
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbac003.
8
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.
9
Deep generative model embedding of single-cell RNA-Seq profiles on hyperspheres and hyperbolic spaces.基于超球和双曲空间的单细胞 RNA-Seq 图谱的深度生成模型嵌入。
Nat Commun. 2021 May 5;12(1):2554. doi: 10.1038/s41467-021-22851-4.
10
Juxtapose: a gene-embedding approach for comparing co-expression networks.并列:一种用于比较共表达网络的基因嵌入方法。
BMC Bioinformatics. 2021 Mar 16;22(1):125. doi: 10.1186/s12859-021-04055-1.