基于模型的自动编码器用于推断离散的单细胞 RNA-seq 数据。

Model-based autoencoders for imputing discrete single-cell RNA-seq data.

机构信息

Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.

NEC Laboratories America, Princeton, NJ 08540, United States.

出版信息

Methods. 2021 Aug;192:112-119. doi: 10.1016/j.ymeth.2020.09.010. Epub 2020 Sep 22.

DOI:10.1016/j.ymeth.2020.09.010

PMID:32971193

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8592282/

Abstract

Deep neural networks have been widely applied for missing data imputation. However, most existing studies have been focused on imputing continuous data, while discrete data imputation is under-explored. Discrete data is common in real world, especially in research areas of bioinformatics, genetics, and biochemistry. In particular, large amounts of recent genomic data are discrete count data generated from single-cell RNA sequencing (scRNA-seq) technology. Most scRNA-seq studies produce a discrete matrix with prevailing 'false' zero count observations (missing values). To make downstream analyses more effective, imputation, which recovers the missing values, is often conducted as the first step in pre-processing scRNA-seq data. In this paper, we propose a novel Zero-Inflated Negative Binomial (ZINB) model-based autoencoder for imputing discrete scRNA-seq data. The novelties of our method are twofold. First, in addition to optimizing the ZINB likelihood, we propose to explicitly model the dropout events that cause missing values by using the Gumbel-Softmax distribution. Second, the zero-inflated reconstruction is further optimized with respect to the raw count matrix. Extensive experiments on simulation datasets demonstrate that the zero-inflated reconstruction significantly improves imputation accuracy. Real data experiments show that the proposed imputation can enhance separating different cell types and improve the accuracy of differential expression analysis.

摘要

深度神经网络已被广泛应用于缺失数据插补。然而，大多数现有研究都集中在连续数据的插补上，而离散数据的插补则研究不足。离散数据在现实世界中很常见，特别是在生物信息学、遗传学和生物化学等研究领域。特别是，大量最近的基因组数据是从单细胞 RNA 测序 (scRNA-seq) 技术生成的离散计数数据。大多数 scRNA-seq 研究产生的离散矩阵普遍存在“虚假”零计数观测值（缺失值）。为了使下游分析更有效，插补（恢复缺失值）通常作为预处理 scRNA-seq 数据的第一步进行。在本文中，我们提出了一种基于零膨胀负二项 (ZINB) 模型的自动编码器，用于插补离散 scRNA-seq 数据。我们方法的新颖之处有两点。首先，除了优化 ZINB 似然度外，我们还建议通过使用 Gumbel-Softmax 分布来显式建模导致缺失值的丢弃事件。其次，对原始计数矩阵进一步优化了零膨胀重建。在模拟数据集上的广泛实验表明，零膨胀重建显著提高了插补准确性。真实数据实验表明，所提出的插补可以增强不同细胞类型的分离，并提高差异表达分析的准确性。