Hoffman Gabriel E, Roussos Panos
Center for Disease Neurogenomics, Department of Psychiatry, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Center for Precision Medicine and Translational Therapeutics, Mental Illness Research, Education and Clinical Center VISN2, James J. Peters VA Medical Center, Bronx, NY, USA.
bioRxiv. 2025 Sep 4:2025.09.01.673591. doi: 10.1101/2025.09.01.673591.
Statistical methods often make assumptions about independence between the samples or features of a dataset. Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice. Whitening transformations are widely applied to remove this correlation structure. Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with samples and features is problematic as approaches or exceeds . Moreover, the computational time becomes prohibitive since the naive transform is cubic in . Here we propose a probabilistic model for data whitening and examine its properties based on first principles as increases. We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features. We examine the out-of-sample performance of the probabilistic whitening model on simulated data, as well as real gene expression and genotype data. In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, implemented in our open source R package decorrelate, had the lowest mean square error while being up to an order of magnitude faster than other methods.
统计方法常常对数据集中样本或特征之间的独立性做出假设。然而,相关结构在实际数据中普遍存在,所以这些假设在实践中往往无法满足。白化变换被广泛应用于去除这种相关结构。现有的白化方法基于标准线性代数,而非概率模型,并且将其应用于具有(n)个样本和(p)个特征的高维数据集时存在问题,因为方法或超过了(n)。此外,由于朴素变换在(p)上是三次方的,计算时间变得令人望而却步。在此,我们提出一种用于数据白化的概率模型,并基于第一原理研究其随着(p)增加的性质。我们展示了概率模型的统计性质,并推导了一种显著高效的算法,该算法在特征数量上是线性时间而非三次方时间。我们研究了概率白化模型在模拟数据以及真实基因表达和基因型数据上的样本外性能。在一项从精神分裂症全基因组关联研究中未观察到的遗传变异推断(z)统计量的应用中,我们开源的R包decorrelate中实现的概率白化变换具有最低的均方误差,同时比其他方法快一个数量级。