de Oliveira Eliezyer Fermino, Garg Pranjal, Hjerling-Leffler Jens, Batista-Brito Renata, Sjulson Lucas
Dominick P. Purpura Department of Neuroscience, Albert Einstein College of Medicine, Bronx, NY.
All India Institute of Medical Sciences, Rishikesh, India.
bioRxiv. 2024 Aug 9:2024.08.08.607264. doi: 10.1101/2024.08.08.607264.
High-dimensional data have become ubiquitous in the biological sciences, and it is often desirable to compare two datasets collected under different experimental conditions to extract low-dimensional patterns enriched in one condition. However, traditional dimensionality reduction techniques cannot accomplish this because they operate on only one dataset. Contrastive principal component analysis (cPCA) has been proposed to address this problem, but it has seen little adoption because it requires tuning a hyperparameter resulting in multiple solutions, with no way of knowing which is correct. Moreover, cPCA uses foreground and background conditions that are treated differently, making it ill-suited to compare two experimental conditions symmetrically. Here we describe the development of generalized contrastive PCA (gcPCA), a flexible hyperparameter-free approach that solves these problems. We first provide analyses explaining why cPCA requires a hyperparameter and how gcPCA avoids this requirement. We then describe an open-source gcPCA toolbox containing Python and MATLAB implementations of several variants of gcPCA tailored for different scenarios. Finally, we demonstrate the utility of gcPCA in analyzing diverse high-dimensional biological data, revealing unsupervised detection of hippocampal replay in neurophysiological recordings and heterogeneity of type II diabetes in single-cell RNA sequencing data. As a fast, robust, and easy-to-use comparison method, gcPCA provides a valuable resource facilitating the analysis of diverse high-dimensional datasets to gain new insights into complex biological phenomena.
高维数据在生物科学中已无处不在,通常希望比较在不同实验条件下收集的两个数据集,以提取在一种条件下富集的低维模式。然而,传统的降维技术无法做到这一点,因为它们仅对一个数据集进行操作。对比主成分分析(cPCA)已被提出用于解决此问题,但由于它需要调整一个超参数,从而产生多个解决方案,且无法知道哪个是正确的,因此很少被采用。此外,cPCA使用的前景和背景条件处理方式不同,使其不适用于对称地比较两个实验条件。在这里,我们描述了广义对比主成分分析(gcPCA)的发展,这是一种灵活的无超参数方法,可以解决这些问题。我们首先进行分析,解释为什么cPCA需要一个超参数以及gcPCA如何避免这种需求。然后,我们描述了一个开源的gcPCA工具箱,其中包含针对不同场景定制的几种gcPCA变体的Python和MATLAB实现。最后,我们展示了gcPCA在分析各种高维生物数据中的效用,揭示了在神经生理学记录中对海马重放的无监督检测以及单细胞RNA测序数据中II型糖尿病的异质性。作为一种快速、稳健且易于使用的比较方法,gcPCA提供了一种宝贵的资源,有助于分析各种高维数据集,以获得对复杂生物现象的新见解。