Li Bo, Zhao Yongkang, Hu Jing, Zhang Shihua, Zhang Xiaolong
School of Computer Science and Technology, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China.
Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China.
Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf128.
Single-cell sequencing technology has enabled researchers to study cellular heterogeneity at the cell level. To facilitate the downstream analysis, clustering single-cell data into subgroups is essential. However, the high dimensionality, sparsity, and dropout events of the data make the clustering challenging. Currently, many deep learning methods have been proposed. Nevertheless, they either fail to fully utilize pairwise distances information between similar cells, or do not adequately capture their feature correlations. They cannot also effectively handle high-dimensional sparse data. Therefore, they are not suitable for high-fidelity clustering, leading to difficulties in analyzing the clear cell types required for downstream analysis. The proposed scSAMAC method integrates contrastive learning and negative binomial losses into a variational autoencoder, extracting features via contrastive unit similarity while preserving the intrinsic characteristics. This enhances the robustness and generalization during the clustering. In the contrastive learning, it constructs a mask module by adopting a negative sample generation method with gene feature saliency adjustment, which selects features more influential in the clustering phase and simulates data missing events. Additionally, it develops a novel loss, which consists of a soft k-means loss, a Wasserstein distance, and a contrastive loss. This fully utilizes data information and improves clustering performance. Furthermore, a multi-head attention mechanism module is applied to the latent variables at each layer of autoencoder to enhance feature correlation, integration, and information repair. Experimental results demonstrate that scSAMAC outperforms several state-of-the-art clustering methods.
单细胞测序技术使研究人员能够在细胞水平上研究细胞异质性。为便于下游分析,将单细胞数据聚类为亚组至关重要。然而,数据的高维度、稀疏性和缺失事件使得聚类具有挑战性。目前,已经提出了许多深度学习方法。然而,它们要么未能充分利用相似细胞之间的成对距离信息,要么没有充分捕捉它们的特征相关性。它们也无法有效处理高维稀疏数据。因此,它们不适用于高保真聚类,导致在分析下游分析所需的清晰细胞类型时遇到困难。所提出的scSAMAC方法将对比学习和负二项式损失集成到变分自编码器中,通过对比单元相似性提取特征,同时保留内在特征。这增强了聚类过程中的鲁棒性和泛化能力。在对比学习中,它采用具有基因特征显著性调整的负样本生成方法构建一个掩码模块,该模块在聚类阶段选择更具影响力的特征并模拟数据缺失事件。此外,它还开发了一种新颖的损失,由软k均值损失、瓦瑟斯坦距离和对比损失组成。这充分利用了数据信息并提高了聚类性能。此外,多头注意力机制模块应用于自编码器各层的潜在变量,以增强特征相关性、整合和信息修复。实验结果表明,scSAMAC优于几种现有的聚类方法。