School of Computer Science, Qufu Normal University, Rizhao 276826, China.
Comput Methods Biomech Biomed Engin. 2024 Jan-Mar;27(4):498-511. doi: 10.1080/10255842.2023.2188106. Epub 2023 Mar 13.
The development and widespread utilization of high-throughput sequencing technologies in biology has fueled the rapid growth of single-cell RNA sequencing (scRNA-seq) data over the past decade. The development of scRNA-seq technology has significantly expanded researchers' understanding of cellular heterogeneity. Accurate cell type identification is the prerequisite for any research on heterogeneous cell populations. However, due to the high noise and high dimensionality of scRNA-seq data, improving the effectiveness of cell type identification remains a challenge. As an effective dimensionality reduction method, Principal Component Analysis (PCA) is an essential tool for visualizing high-dimensional scRNA-seq data and identifying cell subpopulations. However, traditional PCA has some defects when used in mining the nonlinear manifold structure of the data and usually suffers from over-density of principal components (PCs). Therefore, we present a novel method in this paper called joint -norm and random walk graph constrained PCA (RWPPCA). RWPPCA aims to retain the data's local information in the process of mapping high-dimensional data to low-dimensional space, to more accurately obtain sparse principal components and to then identify cell types more precisely. Specifically, RWPPCA combines the random walk (RW) algorithm with graph regularization to more accurately determine the local geometric relationships between data points. Moreover, to mitigate the adverse effects of dense PCs, the -norm is introduced to make the PCs sparser, thus increasing their interpretability. Then, we evaluate the effectiveness of RWPPCA on simulated data and scRNA-seq data. The results show that RWPPCA performs well in cell type identification and outperforms other comparison methods.
在过去的十年中,高通量测序技术在生物学中的发展和广泛应用推动了单细胞 RNA 测序(scRNA-seq)数据的快速增长。scRNA-seq 技术的发展极大地扩展了研究人员对细胞异质性的理解。准确的细胞类型识别是研究异质细胞群体的前提。然而,由于 scRNA-seq 数据的高噪声和高维度,提高细胞类型识别的有效性仍然是一个挑战。作为一种有效的降维方法,主成分分析(PCA)是可视化高维 scRNA-seq 数据和识别细胞亚群的重要工具。然而,传统的 PCA 在挖掘数据的非线性流形结构时存在一些缺陷,通常会遭受主成分(PC)过度密集的问题。因此,我们在本文中提出了一种名为联合范数和随机游走图约束 PCA(RWPPCA)的新方法。RWPPCA 的目的是在将高维数据映射到低维空间的过程中保留数据的局部信息,更准确地获得稀疏的主成分,从而更精确地识别细胞类型。具体来说,RWPPCA 将随机游走(RW)算法与图正则化相结合,以更准确地确定数据点之间的局部几何关系。此外,为了减轻密集 PC 的不利影响,引入了范数以使 PC 更稀疏,从而提高其可解释性。然后,我们在模拟数据和 scRNA-seq 数据上评估了 RWPPCA 的有效性。结果表明,RWPPCA 在细胞类型识别方面表现良好,优于其他比较方法。