nsDCC：基于非均匀采样的双层对比聚类算法，用于 scRNA-seq 数据分析。

nsDCC: dual-level contrastive clustering with nonuniform sampling for scRNA-seq data analysis.

机构信息

School of Computer Science and Engineering, No. 195 Chuangxin Road, Hunnan District, Northeastern University, Shenyang 110819, China.

Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, No. 195 Chuangxin Road, Hunnan District, Shenyang 110000, China.

出版信息

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae477.

DOI:10.1093/bib/bbae477

PMID:39327063

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11427072/

Abstract

Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to "dropout events" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data.

摘要

降维和聚类是单细胞 RNA 测序（scRNA-seq）数据分析中的关键任务，在当前流程中它们是独立处理的，这阻碍了它们的相互受益。最新的方法通过深度聚类联合优化这些任务。然而，对比学习具有强大的表示能力，可以弥合常见的深度聚类方法所面临的差距，即需要预定义的聚类中心。因此，提出了一种具有非均匀采样（nsDCC）的双水平对比聚类方法，用于 scRNA-seq 数据分析。双水平对比聚类结合了实例级对比和聚类级对比，联合优化降维和聚类。在实例级和聚类级对比中分别引入多正对比学习和单位矩阵约束。此外，引入了注意力机制来捕获细胞间信息，这有利于聚类。nsDCC 通过提出的最近边界最稀疏密度权重分配算法，关注类别边界和少数类别的重要样本，使其能够捕获针对不平衡数据集的全面特征。实验结果表明，nsDCC 在真实和模拟 scRNA-seq 数据上均优于其他六种最先进的方法，验证了其在 scRNA-seq 数据降维和聚类方面的性能，特别是对于不平衡数据。模拟实验表明，nsDCC 对 scRNA-seq 中的“dropout 事件”不敏感。最后，聚类差异表达基因分析证实了 nsDCC 结果的有意义性。总之，nsDCC 是一种分析和理解 scRNA-seq 数据的新方法。