Stanley Jay S, Yang Junchen, Li Ruiqi, Lindenbaum Ofir, Kobak Dmitry, Landa Boris, Kluger Yuval
Program in Applied Mathematics, Yale University, New Haven, CT, USA.
Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
bioRxiv. 2025 Feb 7:2025.02.03.636129. doi: 10.1101/2025.02.03.636129.
Principal component analysis (PCA) is indispensable for processing high-throughput omics datasets, as it can extract meaningful biological variability while minimizing the influence of noise. However, the suitability of PCA is contingent on appropriate normalization and transformation of count data, and accurate selection of the number of principal components; improper choices can result in the loss of biological information or corruption of the signal due to excessive noise. Typical approaches to these challenges rely on heuristics that lack theoretical foundations. In this work, we present Biwhitened PCA (BiPCA), a theoretically grounded framework for rank estimation and data denoising across a wide range of omics modalities. BiPCA overcomes a fundamental difficulty with handling count noise in omics data by adaptively rescaling the rows and columns - a rigorous procedure that standardizes the noise variances across both dimensions. Through simulations and analysis of over 100 datasets spanning seven omics modalities, we demonstrate that BiPCA reliably recovers the data rank and enhances the biological interpretability of count data. In particular, BiPCA enhances marker gene expression, preserves cell neighborhoods, and mitigates batch effects. Our results establish BiPCA as a robust and versatile framework for high-throughput count data analysis.
主成分分析(PCA)对于处理高通量组学数据集不可或缺,因为它能够提取有意义的生物学变异性,同时将噪声的影响降至最低。然而,PCA的适用性取决于对计数数据进行适当的归一化和变换,以及准确选择主成分的数量;选择不当可能会导致生物信息丢失或由于噪声过大而使信号失真。应对这些挑战的典型方法依赖于缺乏理论基础的启发式方法。在这项工作中,我们提出了双白化主成分分析(BiPCA),这是一个基于理论的框架,用于跨多种组学模式进行秩估计和数据去噪。BiPCA通过自适应地对行和列进行重新缩放,克服了处理组学数据中计数噪声的一个基本难题——这是一个严格的过程,可使两个维度上的噪声方差标准化。通过对涵盖七种组学模式的100多个数据集进行模拟和分析,我们证明BiPCA能够可靠地恢复数据秩,并增强计数数据的生物学可解释性。特别是,BiPCA增强了标记基因的表达,保留了细胞邻域,并减轻了批次效应。我们的结果表明BiPCA是一个用于高通量计数数据分析的强大且通用的框架。