Chen Liang, Wang Weinan, Zhai Yuyao, Deng Minghua
School of Mathematical Sciences, Peking University, Beijing 100871, China.
Mathematical and Statistical Institute, Northeast Normal University, Changchun 130024, China.
NAR Genom Bioinform. 2020 May 25;2(2):lqaa039. doi: 10.1093/nargab/lqaa039. eCollection 2020 Jun.
Single-cell RNA sequencing (scRNA-seq) allows researchers to study cell heterogeneity at the cellular level. A crucial step in analyzing scRNA-seq data is to cluster cells into subpopulations to facilitate subsequent downstream analysis. However, frequent dropout events and increasing size of scRNA-seq data make clustering such high-dimensional, sparse and massive transcriptional expression profiles challenging. Although some existing deep learning-based clustering algorithms for single cells combine dimensionality reduction with clustering, they either ignore the distance and affinity constraints between similar cells or make some additional latent space assumptions like mixture Gaussian distribution, failing to learn cluster-friendly low-dimensional space. Therefore, in this paper, we combine the deep learning technique with the use of a denoising autoencoder to characterize scRNA-seq data while propose a soft self-training -means algorithm to cluster the cell population in the learned latent space. The self-training procedure can effectively aggregate the similar cells and pursue more cluster-friendly latent space. Our method, called 'scziDesk', alternately performs data compression, data reconstruction and soft clustering iteratively, and the results exhibit excellent compatibility and robustness in both simulated and real data. Moreover, our proposed method has perfect scalability in line with cell size on large-scale datasets.
单细胞RNA测序(scRNA-seq)使研究人员能够在细胞水平上研究细胞异质性。分析scRNA-seq数据的一个关键步骤是将细胞聚类成亚群,以便于后续的下游分析。然而,频繁的缺失事件以及scRNA-seq数据量的不断增加,使得对如此高维、稀疏且海量的转录表达谱进行聚类具有挑战性。尽管现有的一些基于深度学习的单细胞聚类算法将降维与聚类相结合,但它们要么忽略了相似细胞之间的距离和亲和约束,要么做出一些额外的潜在空间假设,如混合高斯分布,未能学习到有利于聚类的低维空间。因此,在本文中,我们将深度学习技术与去噪自编码器的使用相结合来表征scRNA-seq数据,同时提出一种软自训练均值算法,以便在学习到的潜在空间中对细胞群体进行聚类。自训练过程可以有效地聚集相似细胞,并追求更有利于聚类的潜在空间。我们的方法称为“scziDesk”,它交替迭代地执行数据压缩、数据重建和软聚类,并且在模拟数据和真实数据中结果都表现出优异的兼容性和鲁棒性。此外,我们提出的方法在大规模数据集上与细胞大小相关的方面具有完美的可扩展性。