基于零膨胀计数的贝叶斯混合模型方法鉴定单细胞转录组数据中的细胞亚群。

Identifying Subpopulations of Cells in Single-Cell Transcriptomic Data: A Bayesian Mixture Modeling Approach to Zero Inflation of Counts.

机构信息

Department of Computer Science, University of Surrey, Guildford, United Kingdom.

出版信息

J Comput Biol. 2023 Oct;30(10):1059-1074. doi: 10.1089/cmb.2022.0273.

DOI:10.1089/cmb.2022.0273

Abstract

In the study of single-cell RNA-seq (scRNA-Seq) data, a key component of the analysis is to identify subpopulations of cells in the data. A variety of approaches to this have been considered, and although many machine learning-based methods have been developed, these rarely give an estimate of uncertainty in the cluster assignment. To allow for this, probabilistic models have been developed, but scRNA-Seq data exhibit a phenomenon known as dropout, whereby a large proportion of the observed read counts are zero. This poses challenges in developing probabilistic models that appropriately model the data. We develop a novel Dirichlet process mixture model that employs both a mixture at the cell level to model multiple populations of cells and a zero-inflated negative binomial mixture of counts at the transcript level. By taking a Bayesian approach, we are able to model the expression of genes within clusters, and to quantify uncertainty in cluster assignments. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model scRNA-Seq counts and negative binomial models that do not take into account zero inflation. Applied to a publicly available data set of scRNA-Seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish subpopulations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a subpopulation.

摘要

在单细胞 RNA 测序（scRNA-Seq）数据分析中，一个关键的分析步骤是识别数据中的细胞亚群。已经提出了多种方法来实现这一点，尽管已经开发了许多基于机器学习的方法，但这些方法很少能估计聚类分配的不确定性。为了解决这个问题，已经开发了概率模型，但 scRNA-Seq 数据表现出一种称为“dropout”的现象，即大量观察到的读取计数为零。这给开发适当模拟数据的概率模型带来了挑战。我们开发了一种新的狄利克雷过程混合模型，该模型在细胞水平上采用混合模型来模拟多个细胞群体，并在转录本水平上采用零膨胀负二项式混合模型。通过采用贝叶斯方法，我们能够模拟簇内基因的表达，并量化聚类分配的不确定性。结果表明，这种方法优于以前的方法，以前的方法应用多项分布来模拟 scRNA-Seq 计数，而不考虑零膨胀的负二项式模型。将其应用于来自小鼠皮层和海马体的多个细胞类型的 scRNA-Seq 计数的公开可用数据集，我们展示了我们的方法如何用于区分数据中的细胞亚群作为聚类，并识别指示亚群成员身份的基因集。