Liu Angela, Peng Beverly, Pankajam Ajith V, Duong Thu Elizabeth, Pryhuber Gloria, Scheuermann Richard H, Zhang Yun
Department of Informatics, J. Craig Venter Institute, La Jolla, CA, USA.
Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
BMC Methods. 2024;1. doi: 10.1186/s44330-024-00015-2. Epub 2024 Nov 4.
The use of single cell/nucleus RNA sequencing (scRNA-seq) technologies that quantitively describe cell transcriptional phenotypes is revolutionizing our understanding of cell biology, leading to new insights in cell type identification, disease mechanisms, and drug development. The tremendous growth in scRNA-seq data has posed new challenges in efficiently characterizing data-driven cell types and identifying quantifiable marker genes for cell type classification. The use of machine learning and explainable artificial intelligence has emerged as an effective approach to study large-scale scRNA-seq data.
NS-Forest is a random forest machine learning-based algorithm that aims to provide a scalable data-driven solution to identify minimum combinations of necessary and sufficient marker genes that capture cell type identity with maximum classification accuracy. Here, we describe the latest version, NS-Forest version 4.0 and its companion Python package (https://github.com/JCVenterInstitute/NSForest), with several enhancements to select marker gene combinations that exhibit highly selective expression patterns among closely related cell types and more efficiently perform marker gene selection for large-scale scRNA-seq data atlases with millions of cells.
By modularizing the final decision tree step, NS-Forest v4.0 can be used to compare the performance of user-defined marker genes with the NS-Forest computationally-derived marker genes based on the decision tree classifiers. To quantify how well the identified markers exhibit the desired pattern of being exclusively expressed at high levels within their target cell types, we introduce the On-Target Fraction metric that ranges from 0 to 1, with a metric of 1 assigned to markers that are only expressed within their target cell types and not in cells of any other cell types. NS-Forest v4.0 outperforms previous versions in simulation studies and on its ability to identify markers with higher On-Target Fraction values for closely related cell types in real data, and outperforms other marker gene selection approaches for cell type classification with significantly higher F-beta scores when applied to datasets from three human organs-brain, kidney, and lung.
Finally, we discuss potential use cases of the NS-Forest marker genes, including for designing spatial transcriptomics gene panels and semantic representation of cell types in biomedical ontologies, for the broad user community.
单细胞/细胞核RNA测序(scRNA-seq)技术能够定量描述细胞转录表型,正在彻底改变我们对细胞生物学的理解,为细胞类型鉴定、疾病机制和药物开发带来新的见解。scRNA-seq数据的迅猛增长给高效表征数据驱动的细胞类型以及识别用于细胞类型分类的可量化标记基因带来了新的挑战。机器学习和可解释人工智能的应用已成为研究大规模scRNA-seq数据的有效方法。
NS-Forest是一种基于随机森林机器学习的算法,旨在提供一种可扩展的数据驱动解决方案,以识别必要且充分的标记基因的最小组合,从而以最高的分类准确率捕获细胞类型特征。在此,我们描述了最新版本NS-Forest 4.0及其配套的Python包(https://github.com/JCVenterInstitute/NSForest),它有多项改进,可用于选择在密切相关的细胞类型中表现出高度选择性表达模式的标记基因组合,并更有效地为包含数百万个细胞的大规模scRNA-seq数据图谱进行标记基因选择。
通过对最终决策树步骤进行模块化,NS-Forest v4.0可用于基于决策树分类器,将用户定义的标记基因与NS-Forest通过计算得出的标记基因的性能进行比较。为了量化所识别的标记在其靶细胞类型中高水平特异性表达的理想模式的表现程度,我们引入了“靶上分数”指标,其范围为0到1,对于仅在其靶细胞类型中表达而不在任何其他细胞类型中表达的标记,该指标赋值为1。在模拟研究中,NS-Forest v4.0在识别具有更高靶上分数值的标记方面优于先前版本,在实际数据中对密切相关的细胞类型也是如此,并且在应用于来自人类三个器官——脑、肾和肺的数据集进行细胞类型分类时,其F-beta分数显著高于其他标记基因选择方法。
最后,我们讨论了NS-Forest标记基因的潜在应用案例,包括为广大用户群体设计空间转录组学基因面板以及在生物医学本体中进行细胞类型的语义表示。