Ji Xiangling, Tsao Danielle, Bai Kailun, Tsao Min, Xing Li, Zhang Xuekui
Department of Mathematics and Statistics, University of Victoria, Victoria V8P 5C2, Canada.
Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon S7N 5C9, Canada.
Bioinform Adv. 2023 Mar 13;3(1):vbad030. doi: 10.1093/bioadv/vbad030. eCollection 2023.
Single-cell RNA-sequencing (scRNA-seq) technology enables researchers to investigate a genome at the cellular level with unprecedented resolution. An organism consists of a heterogeneous collection of cell types, each of which plays a distinct role in various biological processes. Hence, the first step of scRNA-seq data analysis is often to distinguish cell types so they can be investigated separately. Researchers have recently developed several automated cell-type annotation tools, requiring neither biological knowledge nor subjective human decisions. Dropout is a crucial characteristic of scRNA-seq data widely used in differential expression analysis. However, no current cell annotation method explicitly utilizes dropout information. Fully utilizing dropout information motivated this work.
We present scAnnotate, a cell annotation tool that fully utilizes dropout information. We model every gene's marginal distribution using a mixture model, which describes both the dropout proportion and the distribution of the non-dropout expression levels. Then, using an ensemble machine learning approach, we combine the mixture models of all genes into a single model for cell-type annotation. This combining approach can avoid estimating numerous parameters in the high-dimensional joint distribution of all genes. Using 14 real scRNA-seq datasets, we demonstrate that scAnnotate is competitive against nine existing annotation methods. Furthermore, because of its distinct modelling strategy, scAnnotate's misclassified cells differ greatly from competitor methods. This suggests using scAnnotate together with other methods could further improve annotation accuracy.
We implemented scAnnotate as an R package and made it publicly available from CRAN: https://cran.r-project.org/package=scAnnotate.
Supplementary data are available at online.
单细胞RNA测序(scRNA-seq)技术使研究人员能够以前所未有的分辨率在细胞水平上研究基因组。生物体由多种不同类型的细胞组成,每种细胞在各种生物学过程中都发挥着独特的作用。因此,scRNA-seq数据分析的第一步通常是区分细胞类型,以便能够分别对其进行研究。研究人员最近开发了几种自动化的细胞类型注释工具,既不需要生物学知识,也不需要人为的主观判断。基因数据丢失是scRNA-seq数据的一个关键特征,广泛用于差异表达分析。然而,目前没有细胞注释方法明确利用基因数据丢失信息。充分利用基因数据丢失信息激发了这项研究工作。
我们提出了scAnnotate,一种充分利用基因数据丢失信息的细胞注释工具。我们使用混合模型对每个基因的边缘分布进行建模,该模型描述了基因数据丢失比例和非丢失表达水平的分布。然后,使用集成机器学习方法,我们将所有基因的混合模型组合成一个用于细胞类型注释的单一模型。这种组合方法可以避免在所有基因的高维联合分布中估计大量参数。使用14个真实的scRNA-seq数据集,我们证明scAnnotate与九种现有的注释方法相比具有竞争力。此外,由于其独特的建模策略,scAnnotate误分类的细胞与竞争方法有很大不同。这表明将scAnnotate与其他方法一起使用可以进一步提高注释准确性。
我们将scAnnotate实现为一个R包,并使其可从CRAN公开获取:https://cran.r-project.org/package=scAnnotate。
补充数据可在网上获取。