Xu Ping, Wang Pengfei, Ning Zhiyuan, Xiao Meng, Wu Min, Zhou Yuanchun
Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China.
University of Chinese Academy of Sciences, Beijing, 100864, China.
BMC Bioinformatics. 2025 Jul 25;26(1):195. doi: 10.1186/s12859-025-06231-z.
Clustering analysis is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis for elucidating cellular heterogeneity and diversity. Recent graph-based scRNA-seq clustering methods, particularly graph neural networks (GNNs), have significantly improved in tackling the challenges of high-dimension, high-sparsity, and frequent dropout events that lead to ambiguous cell population boundaries. However, one major challenge for GNN-based methods is their reliance on hard graph constructions derived from similarity matrices. These constructions introduce difficulties when applied to scRNA-seq data due to: (i) The simplification of intercellular relationships into binary edges (0 or 1) by applying thresholds, which restricts the capture of continuous similarity features among cells and leads to significant information loss. (ii) The presence of significant inter-cluster connections within hard graphs, which can confuse GNN methods that rely heavily on graph structures, potentially causing erroneous message propagation and biased clustering outcomes.
To tackle these challenges, we introduce scSGC, a Soft Graph Clustering for single-cell RNA sequencing data, which aims to more accurately characterize continuous similarities among cells through non-binary edge weights, thereby mitigating the limitations of rigid data structures. The scSGC framework comprises three core components: (i) a zero-inflated negative binomial (ZINB)-based feature autoencoder designed to effectively handle the sparsity and dropout issues in scRNA-seq data; (ii) a dual-channel cut-informed soft graph embedding module, constructed through deep graph-cut information, capturing continuous similarities between cells while preserving the intrinsic data structures of scRNA-seq; and (iii) an optimal transport-based clustering optimization module, achieving optimal delineation of cell populations while maintaining high biological relevance.
By integrating dual-channel cut-informed soft graph representation learning, a ZINB-based feature autoencoder, and optimal transport-driven clustering optimization, scSGC effectively overcomes the challenges associated with traditional hard graph constructions in GNN methods. Extensive experiments across ten datasets demonstrate that scSGC outperforms 13 state-of-the-art clustering models in clustering accuracy, cell type annotation, and computational efficiency. These results highlight its substantial potential to advance scRNA-seq data analysis and deepen our understanding of cellular heterogeneity.
聚类分析是单细胞RNA测序(scRNA-seq)数据分析中阐明细胞异质性和多样性的基础。最近基于图的scRNA-seq聚类方法,特别是图神经网络(GNN),在应对导致细胞群体边界模糊的高维、高稀疏性和频繁缺失事件的挑战方面有了显著改进。然而,基于GNN的方法面临的一个主要挑战是它们依赖于从相似性矩阵派生的硬图构建。由于以下原因,这些构建在应用于scRNA-seq数据时会带来困难:(i)通过应用阈值将细胞间关系简化为二元边(0或1),这限制了对细胞间连续相似性特征的捕获并导致大量信息丢失。(ii)硬图中存在显著的簇间连接,这可能会使严重依赖图结构的GNN方法产生混淆,潜在地导致错误的消息传播和有偏差的聚类结果。
为应对这些挑战,我们引入了scSGC,一种用于单细胞RNA测序数据的软图聚类方法,旨在通过非二元边权重更准确地表征细胞间的连续相似性,从而减轻刚性数据结构的局限性。scSGC框架包含三个核心组件:(i)一个基于零膨胀负二项分布(ZINB)的特征自动编码器,旨在有效处理scRNA-seq数据中的稀疏性和缺失问题;(ii)一个双通道割集信息软图嵌入模块,通过深度图割信息构建,在保留scRNA-seq固有数据结构的同时捕获细胞间的连续相似性;(iii)一个基于最优传输的聚类优化模块,在保持高生物学相关性的同时实现细胞群体的最优划分。
通过整合双通道割集信息软图表示学习、基于ZINB的特征自动编码器和最优传输驱动的聚类优化,scSGC有效克服了GNN方法中与传统硬图构建相关的挑战。在十个数据集上进行广泛实验表明,scSGC在聚类准确性、细胞类型注释和计算效率方面优于13种先进的聚类模型。这些结果凸显了其在推进scRNA-seq数据分析和深化我们对细胞异质性理解方面的巨大潜力。