Department of Computer Science and Technology, College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.
Department of Computer Science and Technology, Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China.
Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae599.
Cell clustering is foundational for analyzing the heterogeneity of biological tissues using single-cell sequencing data. With the maturation of single-cell multi-omics sequencing technologies, we can integrate multiple omics data to perform cell clustering, thereby overcoming the limitations of insufficient information from single omics data. Existing methods for cell clustering often only consider the differences in data patterns during the analysis of multi-omics data, but the dependencies between omics features of different cell types also significantly influence cell clustering. Moreover, the high dropout rates in scRNA-seq and scATAC-seq data can impact the performance of cell clustering.
We propose a cell clustering model based on a masked autoencoder, scDRMAE. Utilizing a masking mechanism, scDRMAE effectively learns the relationships between different features and imputes false zeros caused by dropout events. To differentiate the importance of various omics data in cell clustering, we dynamically adjust the weights of different omics data through an attention mechanism. Finally, we use the K-means algorithm for cluster analysis of the fused multi-omics data. On commonly used sets of 15 multi-omics datasets, our method demonstrates superior cell clustering performance on multiple metrics compared to other computational methods. In addition, when datasets exhibit varying degrees of dropout noise, our method shows better performance and stronger stability on multiple metrics compared to other methods. Moreover, by analyzing the cell clusters classified by scDRMAE, we identified several biologically significant biomarkers that have been validated, further confirming the effectiveness of scDRMAE in cell clustering from a biological perspective.
细胞聚类是使用单细胞测序数据分析生物组织异质性的基础。随着单细胞多组学测序技术的成熟,我们可以整合多个组学数据进行细胞聚类,从而克服单一组学数据信息量不足的限制。现有的细胞聚类方法通常只考虑多组学数据分析中数据模式的差异,但不同细胞类型的组学特征之间的依赖性也会显著影响细胞聚类。此外,scRNA-seq 和 scATAC-seq 数据中的高缺失率会影响细胞聚类的性能。
我们提出了一种基于掩蔽自动编码器的细胞聚类模型,scDRMAE。利用掩蔽机制,scDRMAE 可以有效地学习不同特征之间的关系,并对由缺失事件引起的假零进行插补。为了区分不同组学数据在细胞聚类中的重要性,我们通过注意力机制动态调整不同组学数据的权重。最后,我们使用 K-means 算法对融合的多组学数据进行聚类分析。在常用的 15 个多组学数据集上,与其他计算方法相比,我们的方法在多个指标上表现出更好的细胞聚类性能。此外,当数据集表现出不同程度的缺失噪声时,与其他方法相比,我们的方法在多个指标上表现出更好的性能和更强的稳定性。此外,通过分析 scDRMAE 分类的细胞簇,我们鉴定出了一些已被验证的具有生物学意义的生物标志物,进一步从生物学角度证实了 scDRMAE 在细胞聚类中的有效性。