Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
National Engineering Research Center for Beijing Biochip Technology, Beijing, 102206, China.
Sci Rep. 2024 Sep 11;14(1):21183. doi: 10.1038/s41598-024-72204-6.
Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies. Conclusions: Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https://github.com/chengpeng1116/MarkerGeneBERT .
单细胞 RNA 测序 (scRNA-seq) 已成为探索不同物种和组织中细胞图谱的重要工具。精确注释细胞类型对于理解这些图谱至关重要,这严重依赖于经验知识和精心策划的细胞标记物数据库。在本研究中,我们引入了 MarkerGeneBERT,这是一个自然语言处理 (NLP) 系统,旨在从文献中提取关于物种、组织、细胞类型和单细胞测序研究背景下的细胞标记基因的关键信息。利用 MarkerGeneBERT,我们系统地解析了 3702 项单细胞测序相关研究的全文文章,生成了一个全面的数据集,其中包含 7901 个细胞标记物,代表了 425 个人组织/亚组织中的 1606 种细胞类型,以及 8223 个细胞标记物,代表了 482 种小鼠组织/亚组织中的 1674 种细胞类型。与手动策划的数据库进行比较分析表明,我们的方法实现了 76%的完整性和 75%的准确性,同时还揭示了 89 种现有数据库中不存在的细胞类型和 183 个标记基因。此外,我们成功地将 MarkerGeneBERT 中编译的脑组织标记基因列表应用于注释 scRNA-seq 数据,得到的结果与原始研究一致。结论:我们的研究结果强调了基于 NLP 的方法在加速和增强 scRNA-seq 数据注释和解释方面的有效性,为这种方法的变革潜力提供了系统的例证。用于训练 MarkerGeneBERT 的 27323 条手动审查句子和源代码托管在 https://github.com/chengpeng1116/MarkerGeneBERT 上。