Badsha Md Bahadur, Li Rui, Liu Boxiang, Li Yang I, Xian Min, Banovich Nicholas E, Fu Audrey Qiuyan
Department of Statistical Science, Institute for Bioinformatics and Evolutionary Studies, Institute for Modeling Collaboration & Innovation, University of Idaho, Moscow, ID 83844, USA.
Department of Biology, Stanford University, Stanford, CA 94305, USA.
Quant Biol. 2020 Mar;8(1):78-94. doi: 10.1007/s40484-019-0192-7. Epub 2020 Jan 22.
Single-cell RNA-sequencing (scRNA-seq) is a rapidly evolving technology that enables measurement of gene expression levels at an unprecedented resolution. Despite the explosive growth in the number of cells that can be assayed by a single experiment, scRNA-seq still has several limitations, including high rates of dropouts, which result in a large number of genes having zero read count in the scRNA-seq data, and complicate downstream analyses.
To overcome this problem, we treat zeros as missing values and develop nonparametric deep learning methods for imputation. Specifically, our LATE (Learning with AuToEncoder) method trains an autoencoder with random initial values of the parameters, whereas our TRANSLATE (TRANSfer learning with LATE) method further allows for the use of a reference gene expression data set to provide LATE with an initial set of parameter estimates.
On both simulated and real data, LATE and TRANSLATE outperform existing scRNA-seq imputation methods, achieving lower mean squared error in most cases, recovering nonlinear gene-gene relationships, and better separating cell types. They are also highly scalable and can efficiently process over 1 million cells in just a few hours on a GPU.
We demonstrate that our nonparametric approach to imputation based on autoencoders is powerful and highly efficient.
单细胞RNA测序(scRNA-seq)是一项快速发展的技术,能够以前所未有的分辨率测量基因表达水平。尽管单次实验可检测的细胞数量呈爆发式增长,但scRNA-seq仍存在一些局限性,包括高缺失率,这导致scRNA-seq数据中有大量基因的读数为零,从而使下游分析变得复杂。
为克服这一问题,我们将零值视为缺失值,并开发了用于插补的非参数深度学习方法。具体而言,我们的LATE(基于自动编码器的学习)方法使用参数的随机初始值训练自动编码器,而我们的TRANSLATE(基于LATE的迁移学习)方法进一步允许使用参考基因表达数据集为LATE提供一组初始参数估计值。
在模拟数据和真实数据上,LATE和TRANSLATE均优于现有的scRNA-seq插补方法,在大多数情况下实现了更低的均方误差,恢复了非线性基因-基因关系,并能更好地分离细胞类型。它们还具有高度可扩展性,在GPU上只需几个小时就能高效处理超过100万个细胞。
我们证明了基于自动编码器的非参数插补方法强大且高效。