School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan 430205, China.
Peng Cheng Laboratory, and School of Microelectronics, Southern University of Science and Technology, Shenzhen 518055, China.
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae572.
Single-cell RNA sequencing (scRNA-seq) technology is one of the most cost-effective and efficacious methods for revealing cellular heterogeneity and diversity. Precise identification of cell types is essential for establishing a robust foundation for downstream analyses and is a prerequisite for understanding heterogeneous mechanisms. However, the accuracy of existing methods warrants improvement, and highly accurate methods often impose stringent equipment requirements. Moreover, most unsupervised learning-based approaches are constrained by the need to input the number of cell types a prior, which limits their widespread application. In this paper, we propose a novel algorithm framework named WLGG. Initially, to capture the underlying nonlinear information, we introduce a weighted distance penalty term utilizing the Gaussian kernel function, which maps data from a low-dimensional nonlinear space to a high-dimensional linear space. We subsequently impose a Lasso constraint on the regularized Gaussian graphical model to enhance its ability to capture linear data characteristics. Additionally, we utilize the Eigengap strategy to predict the number of cell types and obtain predicted labels via spectral clustering. The experimental results on 14 test datasets demonstrate the superior clustering accuracy of the WLGG algorithm over 16 alternative methods. Furthermore, downstream analysis, including marker gene identification, pseudotime inference, and functional enrichment analysis based on the similarity matrix and predicted labels from the WLGG algorithm, substantiates the reliability of WLGG and offers valuable insights into biological dynamic biological processes and regulatory mechanisms.
单细胞 RNA 测序 (scRNA-seq) 技术是揭示细胞异质性和多样性的最具成本效益和高效的方法之一。精确识别细胞类型对于为下游分析建立稳健的基础至关重要,也是理解异质机制的前提。然而,现有方法的准确性需要提高,而高精度的方法通常需要严格的设备要求。此外,大多数基于无监督学习的方法受到需要预先输入细胞类型数量的限制,这限制了它们的广泛应用。在本文中,我们提出了一种名为 WLGG 的新算法框架。首先,为了捕捉潜在的非线性信息,我们引入了利用高斯核函数的加权距离惩罚项,将数据从低维非线性空间映射到高维线性空间。随后,我们对正则化高斯图形模型施加 Lasso 约束,以增强其捕获线性数据特征的能力。此外,我们利用特征间隙策略来预测细胞类型的数量,并通过谱聚类获得预测标签。在 14 个测试数据集上的实验结果表明,WLGG 算法的聚类精度优于 16 种替代方法。此外,基于相似度矩阵和 WLGG 算法的预测标签的下游分析,包括标记基因识别、伪时间推断和功能富集分析,证实了 WLGG 的可靠性,并为生物动态生物学过程和调控机制提供了有价值的见解。