Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvannia, 423 Guardian Drive, Philadelphia, PA 19104, USA.
Biostatistics. 2022 Dec 12;24(1):161-176. doi: 10.1093/biostatistics/kxab011.
Single-cell RNA-sequencing (scRNAseq) data contain a high level of noise, especially in the form of zero-inflation, that is, the presence of an excessively large number of zeros. This is largely due to dropout events and amplification biases that occur in the preparation stage of single-cell experiments. Recent scRNAseq experiments have been augmented with unique molecular identifiers (UMI) and External RNA Control Consortium (ERCC) molecules which can be used to account for zero-inflation. However, most of the current methods on graphical models are developed under the assumption of the multivariate Gaussian distribution or its variants, and thus they are not able to adequately account for an excessively large number of zeros in scRNAseq data. In this article, we propose a single-cell latent graphical model (scLGM)-a Bayesian hierarchical model for estimating the conditional dependency network among genes using scRNAseq data. Taking advantage of UMI and ERCC data, scLGM explicitly models the two sources of zero-inflation. Our simulation study and real data analysis demonstrate that the proposed approach outperforms several existing methods.
单细胞 RNA 测序 (scRNAseq) 数据包含高水平的噪声,特别是零膨胀的形式,即存在大量的零值。这主要是由于单细胞实验准备阶段的丢包事件和扩增偏差引起的。最近的 scRNAseq 实验已经添加了独特分子标识符 (UMI) 和外部 RNA 对照协会 (ERCC) 分子,这些分子可用于解释零膨胀。然而,目前图形模型上的大多数方法都是在假设多元高斯分布或其变体的情况下开发的,因此它们不能充分考虑 scRNAseq 数据中大量的零值。在本文中,我们提出了一种单细胞潜在图形模型 (scLGM)——一种使用 scRNAseq 数据估计基因间条件依赖网络的贝叶斯层次模型。利用 UMI 和 ERCC 数据,scLGM 明确地对两种零膨胀源进行建模。我们的模拟研究和实际数据分析表明,所提出的方法优于几种现有的方法。