Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA.
Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.
Comput Biol Med. 2024 Jun;175:108497. doi: 10.1016/j.compbiomed.2024.108497. Epub 2024 Apr 24.
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.
单细胞 RNA 测序 (scRNA-seq) 被广泛用于揭示细胞异质性,使我们能够深入了解细胞间通讯、细胞分化和差异基因表达。然而,由于稀疏性和涉及的大量基因,分析 scRNA-seq 数据是一项挑战。因此,降维和特征选择对于去除虚假信号和增强下游分析非常重要。传统的主成分分析 (PCA) 作为降维的主要工具,缺乏捕获数据中嵌入的几何结构信息的能力,而以前的图拉普拉斯正则化受到仅分析单一尺度的限制。我们提出了一种拓扑主成分分析 (tPCA) 方法,通过结合持久拉普拉斯 (PL) 技术和 L 范数正则化来解决数据中的多尺度和多类异质性问题。我们进一步引入了 k-最近邻 (kNN) 持久拉普拉斯技术来提高我们的持久拉普拉斯方法的鲁棒性。所提出的 kNN-PL 是一种新的代数拓扑技术,解决了传统持久同调的许多限制。我们不是通过改变距离阈值来诱导滤波,而是引入了 kNN-tPCA,其中在每个步骤中通过改变 kNN 网络中的邻居数量来实现滤波,并发现该框架对超参数调整具有重要意义。我们在 11 个不同的 scRNA-seq 基准数据集上验证了我们提出的 tPCA 和 kNN-tPCA 方法的有效性,并展示了我们的方法优于文献中其他无监督 PCA 增强方法,以及流行的一致流形逼近 (UMAP)、t 分布随机近邻嵌入 (tSNE) 和非负矩阵分解 (NMF),在 F1 度量的分类方面,tPCA 分别提供了高达 628%、78%和 149%的改进,kNN-tPCA 分别在 ARI 度量的聚类方面提供了 53%、63%和 32%的改进。例如,tPCA 在 F1 度量的分类方面提供了高达 628%、78%和 149%的改进,kNN-tPCA 在 ARI 度量的聚类方面提供了 53%、63%和 32%的改进。