使用具有相关随机效应的障碍模型检测离散单细胞RNA测序数据中的差异表达基因。

Detection of differentially expressed genes in discrete single-cell RNA sequencing data using a hurdle model with correlated random effects.

作者信息

Sekula Michael, Gaskins Jeremy, Datta Susmita

机构信息

Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, Kentucky.

Department of Biostatistics, University of Florida, Gainesville, Florida.

出版信息

Biometrics. 2019 Dec;75(4):1051-1062. doi: 10.1111/biom.13074. Epub 2019 Jun 17.

DOI:10.1111/biom.13074

PMID:31009065

Abstract

Single-cell RNA sequencing (scRNA-seq) technologies are revolutionary tools allowing researchers to examine gene expression at the level of a single cell. Traditionally, transcriptomic data have been analyzed from bulk samples, masking the heterogeneity now seen across individual cells. Even within the same cellular population, genes can be highly expressed in some cells but not expressed (or lowly expressed) in others. Therefore, the computational approaches used to analyze bulk RNA sequencing data are not appropriate for the analysis of scRNA-seq data. Here, we present a novel statistical model for high dimensional and zero-inflated scRNA-seq count data to identify differentially expressed (DE) genes across cell types. Correlated random effects are employed based on an initial clustering of cells to capture the cell-to-cell variability within treatment groups. Moreover, this model is flexible and can be easily adapted to an independent random effect structure if needed. We apply our proposed methodology to both simulated and real data and compare results to other popular methods designed for detecting DE genes. Due to the hurdle model's ability to detect differences in the proportion of cells expressed and the average expression level (among the expressed cells), our methods naturally identify some genes as DE that other methods do not, and we demonstrate with real data that these uniquely detected genes are associated with similar biological processes and functions.

摘要

单细胞RNA测序（scRNA-seq）技术是具有革命性的工具，使研究人员能够在单细胞水平上检测基因表达。传统上，转录组数据是从大量样本中分析的，这掩盖了现在在单个细胞中看到的异质性。即使在同一细胞群体中，基因在某些细胞中可能高度表达，但在其他细胞中不表达（或低表达）。因此，用于分析大量RNA测序数据的计算方法不适用于scRNA-seq数据的分析。在这里，我们提出了一种针对高维零膨胀scRNA-seq计数数据的新型统计模型，以识别跨细胞类型的差异表达（DE）基因。基于细胞的初始聚类采用相关随机效应，以捕获治疗组内细胞间的变异性。此外，该模型具有灵活性，如果需要，可以轻松适应独立的随机效应结构。我们将我们提出的方法应用于模拟数据和真实数据，并将结果与其他用于检测DE基因的流行方法进行比较。由于障碍模型能够检测表达细胞比例和平均表达水平（在表达细胞中）的差异，我们的方法自然会识别出一些其他方法未识别的DE基因，并且我们用真实数据证明这些独特检测到的基因与相似的生物学过程和功能相关。