• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于高效提取细胞标记物的自然语言处理系统。

A natural language processing system for the efficient extraction of cell markers.

机构信息

Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.

National Engineering Research Center for Beijing Biochip Technology, Beijing, 102206, China.

出版信息

Sci Rep. 2024 Sep 11;14(1):21183. doi: 10.1038/s41598-024-72204-6.

DOI:10.1038/s41598-024-72204-6
PMID:39261578
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11390993/
Abstract

Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies. Conclusions: Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https://github.com/chengpeng1116/MarkerGeneBERT .

摘要

单细胞 RNA 测序 (scRNA-seq) 已成为探索不同物种和组织中细胞图谱的重要工具。精确注释细胞类型对于理解这些图谱至关重要,这严重依赖于经验知识和精心策划的细胞标记物数据库。在本研究中,我们引入了 MarkerGeneBERT,这是一个自然语言处理 (NLP) 系统,旨在从文献中提取关于物种、组织、细胞类型和单细胞测序研究背景下的细胞标记基因的关键信息。利用 MarkerGeneBERT,我们系统地解析了 3702 项单细胞测序相关研究的全文文章,生成了一个全面的数据集,其中包含 7901 个细胞标记物,代表了 425 个人组织/亚组织中的 1606 种细胞类型,以及 8223 个细胞标记物,代表了 482 种小鼠组织/亚组织中的 1674 种细胞类型。与手动策划的数据库进行比较分析表明,我们的方法实现了 76%的完整性和 75%的准确性,同时还揭示了 89 种现有数据库中不存在的细胞类型和 183 个标记基因。此外,我们成功地将 MarkerGeneBERT 中编译的脑组织标记基因列表应用于注释 scRNA-seq 数据,得到的结果与原始研究一致。结论:我们的研究结果强调了基于 NLP 的方法在加速和增强 scRNA-seq 数据注释和解释方面的有效性,为这种方法的变革潜力提供了系统的例证。用于训练 MarkerGeneBERT 的 27323 条手动审查句子和源代码托管在 https://github.com/chengpeng1116/MarkerGeneBERT 上。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/633ae194c129/41598_2024_72204_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/50b1ac080714/41598_2024_72204_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/f5d394138475/41598_2024_72204_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/be5546609ce2/41598_2024_72204_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/5b446fd44057/41598_2024_72204_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/e89ce840d10e/41598_2024_72204_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/cfeedc3cb007/41598_2024_72204_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/4024ac073d83/41598_2024_72204_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/0ee04025bad0/41598_2024_72204_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/633ae194c129/41598_2024_72204_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/50b1ac080714/41598_2024_72204_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/f5d394138475/41598_2024_72204_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/be5546609ce2/41598_2024_72204_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/5b446fd44057/41598_2024_72204_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/e89ce840d10e/41598_2024_72204_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/cfeedc3cb007/41598_2024_72204_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/4024ac073d83/41598_2024_72204_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/0ee04025bad0/41598_2024_72204_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d76/11390993/633ae194c129/41598_2024_72204_Fig9_HTML.jpg

相似文献

1
A natural language processing system for the efficient extraction of cell markers.一种用于高效提取细胞标记物的自然语言处理系统。
Sci Rep. 2024 Sep 11;14(1):21183. doi: 10.1038/s41598-024-72204-6.
2
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets.文献衍生知识图谱增强单细胞 RNA-seq 数据集的解读。
Genes (Basel). 2021 Jun 10;12(6):898. doi: 10.3390/genes12060898.
3
deCS: A Tool for Systematic Cell Type Annotations of Single-cell RNA Sequencing Data among Human Tissues.deCS:一种用于人类组织中单细胞 RNA 测序数据的系统细胞类型注释的工具。
Genomics Proteomics Bioinformatics. 2023 Apr;21(2):370-384. doi: 10.1016/j.gpb.2022.04.001. Epub 2022 Apr 22.
4
scPLAN: a hierarchical computational framework for single transcriptomics data annotation, integration and cell-type label refinement.scPLAN:一种用于单细胞转录组学数据注释、整合和细胞类型标签细化的分层计算框架。
Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae305.
5
TripletCell: a deep metric learning framework for accurate annotation of cell types at the single-cell level.三重细胞:一种用于单细胞水平准确注释细胞类型的深度度量学习框架。
Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad132.
6
Assessing parameter efficient methods for pre-trained language model in annotating scRNA-seq data.评估用于注释 scRNA-seq 数据的预训练语言模型的参数高效方法。
Methods. 2024 Aug;228:12-21. doi: 10.1016/j.ymeth.2024.05.007. Epub 2024 May 15.
7
Shaoxia: a web-based interactive analysis platform for single cell RNA sequencing data.Shaoxia:一个用于单细胞RNA测序数据的基于网络的交互式分析平台。
BMC Genomics. 2024 Apr 24;25(1):402. doi: 10.1186/s12864-024-10322-1.
8
A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa.一种用于隐性营养不良型大疱性表皮松解症的单细胞 RNA-seq 分析的多任务聚类方法。
PLoS Comput Biol. 2018 Apr 9;14(4):e1006053. doi: 10.1371/journal.pcbi.1006053. eCollection 2018 Apr.
9
scAnno: a deconvolution strategy-based automatic cell type annotation tool for single-cell RNA-sequencing data sets.scAnno:一种基于去卷积策略的单细胞 RNA 测序数据集自动细胞类型注释工具。
Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad179.
10
SCMcluster: a high-precision cell clustering algorithm integrating marker gene set with single-cell RNA sequencing data.SCMcluster:一种高精度的细胞聚类算法,整合了标记基因集与单细胞 RNA 测序数据。
Brief Funct Genomics. 2023 Jul 17;22(4):329-340. doi: 10.1093/bfgp/elad004.

引用本文的文献

1
Application of machine learning based genome sequence analysis in pathogen identification.基于机器学习的基因组序列分析在病原体鉴定中的应用。
Front Microbiol. 2024 Oct 2;15:1474078. doi: 10.3389/fmicb.2024.1474078. eCollection 2024.

本文引用的文献

1
singleCellBase: a high-quality manually curated database of cell markers for single cell annotation across multiple species.SingleCellBase:一个经过人工精心整理的高质量数据库,包含多个物种用于单细胞注释的细胞标志物。
Biomark Res. 2023 Sep 20;11(1):83. doi: 10.1186/s40364-023-00523-3.
2
Cancer-associated fibroblasts: from basic science to anticancer therapy.癌相关成纤维细胞:从基础科学到抗癌治疗。
Exp Mol Med. 2023 Jul;55(7):1322-1332. doi: 10.1038/s12276-023-01013-0. Epub 2023 Jul 3.
3
A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing.
一种使用自然语言处理从大型聚合物语料库中提取通用材料属性数据的管道。
NPJ Comput Mater. 2023;9(1):52. doi: 10.1038/s41524-023-01003-w. Epub 2023 Apr 5.
4
MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction.MarkerGenie:一个用于生物医学实体关系提取的支持自然语言处理的文本挖掘系统。
Bioinform Adv. 2022 May 13;2(1):vbac035. doi: 10.1093/bioadv/vbac035. eCollection 2022.
5
CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data.CellMarker 2.0:一个更新的数据库,包含基于 scRNA-seq 数据的人类/小鼠细胞标志物的人工注释和网络工具。
Nucleic Acids Res. 2023 Jan 6;51(D1):D870-D876. doi: 10.1093/nar/gkac947.
6
Single-cell RNA sequencing technologies and applications: A brief overview.单细胞 RNA 测序技术及应用:简述。
Clin Transl Med. 2022 Mar;12(3):e694. doi: 10.1002/ctm2.694.
7
PCMDB: a curated and comprehensive resource of plant cell markers.PCMDB:一个经过精心整理和全面的植物细胞标记物资源库。
Nucleic Acids Res. 2022 Jan 7;50(D1):D1448-D1455. doi: 10.1093/nar/gkab949.
8
Development of a generalizable natural language processing pipeline to extract physician-reported pain from clinical reports: Generated using publicly-available datasets and tested on institutional clinical reports for cancer patients with bone metastases.开发一种可推广的自然语言处理管道,从临床报告中提取医生报告的疼痛:使用公开可用的数据集生成,并在患有骨转移的癌症患者的机构临床报告上进行测试。
J Biomed Inform. 2021 Aug;120:103864. doi: 10.1016/j.jbi.2021.103864. Epub 2021 Jul 12.
9
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets.文献衍生知识图谱增强单细胞 RNA-seq 数据集的解读。
Genes (Basel). 2021 Jun 10;12(6):898. doi: 10.3390/genes12060898.
10
Resolving cellular and molecular diversity along the hippocampal anterior-to-posterior axis in humans.解析人类海马体前-后轴上的细胞和分子多样性。
Neuron. 2021 Jul 7;109(13):2091-2105.e6. doi: 10.1016/j.neuron.2021.05.003. Epub 2021 May 28.