School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China.
Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae020.
Single-cell RNA sequencing has emerged as a powerful technology for studying gene expression at the individual cell level. Clustering individual cells into distinct subpopulations is fundamental in scRNA-seq data analysis, facilitating the identification of cell types and exploration of cellular heterogeneity. Despite the recent development of many deep learning-based single-cell clustering methods, few have effectively exploited the correlations among genes, resulting in suboptimal clustering outcomes.
Here, we propose a novel masked autoencoder-based method, scMAE, for cell clustering. scMAE perturbs gene expression and employs a masked autoencoder to reconstruct the original data, learning robust and informative cell representations. The masked autoencoder introduces a masking predictor, which captures relationships among genes by predicting whether gene expression values are masked. By integrating this masking mechanism, scMAE effectively captures latent structures and dependencies in the data, enhancing clustering performance. We conducted extensive comparative experiments using various clustering evaluation metrics on 15 scRNA-seq datasets from different sequencing platforms. Experimental results indicate that scMAE outperforms other state-of-the-art methods on these datasets. In addition, scMAE accurately identifies rare cell types, which are challenging to detect due to their low abundance. Furthermore, biological analyses confirm the biological significance of the identified cell subpopulations.
The source code of scMAE is available at: https://zenodo.org/records/10465991.
单细胞 RNA 测序技术已成为研究单个细胞水平基因表达的强大技术。将单个细胞聚类为不同的亚群是 scRNA-seq 数据分析中的基础步骤,有助于识别细胞类型和探索细胞异质性。尽管最近开发了许多基于深度学习的单细胞聚类方法,但很少有方法有效地利用基因之间的相关性,导致聚类结果不理想。
在这里,我们提出了一种基于掩蔽自动编码器的新型细胞聚类方法 scMAE。scMAE 会干扰基因表达,并使用掩蔽自动编码器来重建原始数据,从而学习到稳健且信息量丰富的细胞表示。掩蔽自动编码器引入了一个掩蔽预测器,通过预测基因表达值是否被掩蔽来捕获基因之间的关系。通过整合这种掩蔽机制,scMAE 可以有效地捕获数据中的潜在结构和依赖关系,从而提高聚类性能。我们在 15 个来自不同测序平台的 scRNA-seq 数据集上使用各种聚类评估指标进行了广泛的对比实验。实验结果表明,scMAE 在这些数据集上优于其他最先进的方法。此外,scMAE 可以准确地识别稀有细胞类型,这些细胞类型由于数量较少而难以检测。此外,生物学分析证实了所识别的细胞亚群具有生物学意义。
scMAE 的源代码可在 https://zenodo.org/records/10465991 获得。