Gan Luqin, Vinci Giuseppe, Allen Genevera I
Rice University.
University of Notre Dame.
ACM BCB. 2020 Sep;2020. doi: 10.1145/3388440.3412462.
Single cell RNA sequencing is a powerful technique that measures the gene expression of individual cells in a high throughput fashion. However, due to sequencing inefficiency, the data is unreliable due to dropout events, or technical artifacts where genes erroneously appear to have zero expression. Many data imputation methods have been proposed to alleviate this issue. Yet, effective imputation can be difficult and biased because the data is sparse and high-dimensional, resulting in major distortions in downstream analyses. In this paper, we propose a completely novel approach that imputes the gene-by-gene correlations rather than the data itself. We call this method SCENA: Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information. The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections. In an extensive simulation study based on real scRNA-seq data, we demonstrate that SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.
单细胞RNA测序是一项强大的技术,它以高通量方式测量单个细胞的基因表达。然而,由于测序效率低下,数据因缺失事件或技术假象(即基因错误地显示为零表达)而不可靠。已经提出了许多数据插补方法来缓解这个问题。然而,有效的插补可能很困难且存在偏差,因为数据是稀疏且高维的,这会导致下游分析出现重大失真。在本文中,我们提出了一种全新的方法,该方法插补的是逐个基因的相关性而非数据本身。我们将此方法称为SCENA:通过集成学习和辅助信息完成单细胞RNA测序相关性。SCENA逐个基因的相关矩阵估计是通过基于关于基因连接的已知辅助信息对多个插补相关矩阵进行模型堆叠而获得的。在基于真实单细胞RNA测序数据的广泛模拟研究中,我们证明SCENA不仅能准确插补基因相关性,而且在下游分析(如降维、细胞聚类、图形模型估计)中也优于现有的插补方法。