Bickmann Lucas, Sandmann Sarah, Walter Carolin, Varghese Julian
Institute of Medical Data Science, Otto-von-Guericke University Magdeburg, Magdeburg, Germany.
Institute of Medical Informatics, University of Münster, Münster, Germany.
BMC Bioinformatics. 2025 Jul 29;26(1):201. doi: 10.1186/s12859-025-06236-8.
Rapid extraction and visualization of cell-specific gene expression is important for automatic cell type annotation, e.g. in single cell analysis. There is an emerging field in which tools such as curated databases or machine learning methods are used to support cell type annotation. However, complementing approaches to efficiently incorporate the latest knowledge of free-text articles from literature databases, such as PubMed, are understudied.
This work introduces the PubMed Gene/Cell type-Relation Atlas (PuMA) which provides a local, easy-to-use web-interface to facilitate literature-driven cell type annotation. It utilizes a pretrained machine learning based named entity recognition model in order to extract gene and cell type concepts from PubMed, links biomedical ontologies, and suggests gene to cell type relations based on a ranking score. It includes a search tool for genes and cell types, additionally providing an interactive graph visualization for exploring cross-relations. Each result is fully traceable by linking the relevant PubMed articles.
This work enables researchers to analyse and automatize cell type annotation based on PubMed articles. It complements manual curated marker gene databases and enables interactive visualizations. The evaluation shows that PuMA is competitive against an extensive manual curated database across three gold standard datasets and two species-mouse and human. The software framework is freely available and enables regular article imports for incremental knowledge updates.GitLab: https://imigitlab.uni-muenster.de/published/PuMA/.
快速提取和可视化细胞特异性基因表达对于自动细胞类型注释非常重要,例如在单细胞分析中。目前有一个新兴领域,使用诸如经过整理的数据库或机器学习方法等工具来支持细胞类型注释。然而,对于如何有效整合来自文献数据库(如PubMed)的自由文本文章的最新知识的补充方法,研究还不够充分。
这项工作引入了PubMed基因/细胞类型关系图谱(PuMA),它提供了一个本地的、易于使用的网络界面,以促进基于文献的细胞类型注释。它利用一个基于预训练机器学习的命名实体识别模型,从PubMed中提取基因和细胞类型概念,链接生物医学本体,并根据排名分数建议基因与细胞类型的关系。它包括一个基因和细胞类型搜索工具,还提供一个交互式图形可视化工具,用于探索交叉关系。通过链接相关的PubMed文章,每个结果都可以完全追溯。
这项工作使研究人员能够基于PubMed文章分析和自动化细胞类型注释。它补充了手动整理的标记基因数据库,并实现了交互式可视化。评估表明,在三个金标准数据集以及小鼠和人类这两个物种上,PuMA与一个广泛的手动整理数据库相比具有竞争力。该软件框架可免费获得,并支持定期导入文章以进行增量知识更新。GitLab:https://imigitlab.uni-muenster.de/published/PuMA/ 。