Cui Tao, Wang Tingting
Department of Pharmacology and Physiology, Georgetown University Medical Center, Washington, DC, 20057, USA.
Interdisciplinary Program in Neuroscience, Georgetown University Medical Center, Washington, DC, 20057, USA.
BMC Genomics. 2021 Jan 11;22(1):47. doi: 10.1186/s12864-020-07302-6.
Single-cell RNA-Sequencing (scRNA-Seq) has provided single-cell level insights into complex biological processes. However, the high frequency of gene expression detection failures in scRNA-Seq data make it challenging to achieve reliable identification of cell-types and Differentially Expressed Genes (DEG). Moreover, with the explosive growth of single-cell data using 10x genomics protocol, existing methods will soon reach the computation limit due to scalability issues. The single-cell transcriptomics field desperately need new tools and framework to facilitate large-scale single-cell analysis.
In order to improve the accuracy, robustness, and speed of scRNA-Seq data processing, we propose a generalized zero-inflated negative binomial mixture model, "JOINT," that can perform probability-based cell-type discovery and DEG analysis simultaneously without the need for imputation. JOINT performs soft-clustering for cell-type identification by computing the probability of individual cells, i.e. each cell can belong to multiple cell types with different probabilities. This is drastically different from existing hard-clustering methods where each cell can only belong to one cell type. The soft-clustering component of the algorithm significantly facilitates the accuracy and robustness of single-cell analysis, especially when the scRNA-Seq datasets are noisy and contain a large number of dropout events. Moreover, JOINT is able to determine the optimal number of cell-types automatically rather than specifying it empirically. The proposed model is an unsupervised learning problem which is solved by using the Expectation and Maximization (EM) algorithm. The EM algorithm is implemented using the TensorFlow deep learning framework, dramatically accelerating the speed for data analysis through parallel GPU computing.
Taken together, the JOINT algorithm is accurate and efficient for large-scale scRNA-Seq data analysis via parallel computing. The Python package that we have developed can be readily applied to aid future advances in parallel computing-based single-cell algorithms and research in various biological and biomedical fields.
单细胞RNA测序(scRNA-Seq)已在单细胞水平上为复杂生物过程提供了深入见解。然而,scRNA-Seq数据中基因表达检测失败的频率很高,这使得可靠地识别细胞类型和差异表达基因(DEG)具有挑战性。此外,随着使用10x基因组学协议的单细胞数据呈爆炸式增长,由于可扩展性问题,现有方法很快将达到计算极限。单细胞转录组学领域迫切需要新的工具和框架来促进大规模单细胞分析。
为了提高scRNA-Seq数据处理的准确性、稳健性和速度,我们提出了一种广义零膨胀负二项混合模型“JOINT”,它可以同时进行基于概率的细胞类型发现和DEG分析,而无需进行插补。JOINT通过计算单个细胞的概率来进行细胞类型识别的软聚类,即每个细胞可以以不同概率属于多种细胞类型。这与现有的硬聚类方法有很大不同,在硬聚类方法中每个细胞只能属于一种细胞类型。该算法的软聚类组件显著提高了单细胞分析的准确性和稳健性,特别是当scRNA-Seq数据集存在噪声且包含大量缺失事件时。此外,JOINT能够自动确定最佳细胞类型数量,而不是凭经验指定。所提出的模型是一个无监督学习问题,通过使用期望最大化(EM)算法来解决。EM算法使用TensorFlow深度学习框架实现,通过并行GPU计算极大地加快了数据分析速度。
总之,JOINT算法通过并行计算对大规模scRNA-Seq数据分析准确且高效。我们开发的Python包可以很容易地应用于推动基于并行计算的单细胞算法的未来发展以及各种生物学和生物医学领域的研究。