Duan Hongyu, Li Feng, Shang Junliang, Liu Jinxing, Li Yan, Liu Xikui
School of Computer Science, Qufu Normal University, Rizhao, 276826, China.
Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan, 250031, Shandong, China.
Interdiscip Sci. 2022 Dec;14(4):917-928. doi: 10.1007/s12539-022-00536-w. Epub 2022 Aug 8.
A surge in research has occurred because of current developments in single-cell technologies. Above all, single-cell Assay for Transposase-Accessible Chromatin with high throughput sequencing (scATAC-seq) is a popular approach of analyzing chromatin accessibility differences at the level of single cell, either within or between groups. As a result, it is critical to examine cell heterogeneity at a previously unseen level and to identify both recognized and unknown cell types. However, with the ever-increasing number of cells engendered by technological development and the characteristics of the data, such as high noise, sparsity and dimension, challenges in distinguishing cell types have emerged. We propose scVAEBGM, which integrates a Variational Autoencoder (VAE) with a Bayesian Gaussian-mixture model (BGM) to process and analyze scATAC-seq data. This method combines and takes benefits of a Bayesian Gaussian mixture model to estimate the number of cell types without determining the cluster number in a beforehand. In other words, the size of the clusters is inferred from the data, thus avoiding biases introduced by subjective assessments when manually determining the size of the clusters. Additionally, the method is more robust to noise and can better represent single-cell data in lower dimensions. We also create a further clustering strategy. It is indicated by experiments that further clustering based on the already completed clustering can improve the clustering accuracy again. We test on six public datasets, and scVAEBGM outperforms various dimension reduction baselines. In downstream applications, scVAEBGM can reveal biological cell types.
由于单细胞技术的当前发展,研究出现了激增。最重要的是,单细胞转座酶可及染色质高通量测序分析(scATAC-seq)是一种在单细胞水平分析组内或组间染色质可及性差异的常用方法。因此,在前所未有的水平上检查细胞异质性并识别已知和未知细胞类型至关重要。然而,随着技术发展产生的细胞数量不断增加以及数据的高噪声、稀疏性和高维性等特征,区分细胞类型面临挑战。我们提出了scVAEBGM,它将变分自编码器(VAE)与贝叶斯高斯混合模型(BGM)集成,用于处理和分析scATAC-seq数据。该方法结合并利用贝叶斯高斯混合模型来估计细胞类型的数量,而无需事先确定聚类数量。换句话说,聚类大小是从数据中推断出来的,从而避免了手动确定聚类大小时主观评估引入的偏差。此外,该方法对噪声更具鲁棒性,并且能够在低维度下更好地表示单细胞数据。我们还创建了一种进一步的聚类策略。实验表明,基于已完成聚类的进一步聚类可以再次提高聚类准确性。我们在六个公共数据集上进行了测试,scVAEBGM优于各种降维基线。在下游应用中,scVAEBGM可以揭示生物细胞类型。