Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, W2 1PG, UK.
Statistics Section, Department of Mathematics, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK.
BMC Bioinformatics. 2019 Jan 9;20(1):15. doi: 10.1186/s12859-018-2572-9.
Canonical correlation analysis (CCA) is a classic statistical tool for investigating complex multivariate data. Correspondingly, it has found many diverse applications, ranging from molecular biology and medicine to social science and finance. Intriguingly, despite the importance and pervasiveness of CCA, only recently a probabilistic understanding of CCA is developing, moving from an algorithmic to a model-based perspective and enabling its application to large-scale settings.
Here, we revisit CCA from the perspective of statistical whitening of random variables and propose a simple yet flexible probabilistic model for CCA in the form of a two-layer latent variable generative model. The advantages of this variant of probabilistic CCA include non-ambiguity of the latent variables, provisions for negative canonical correlations, possibility of non-normal generative variables, as well as ease of interpretation on all levels of the model. In addition, we show that it lends itself to computationally efficient estimation in high-dimensional settings using regularized inference. We test our approach to CCA analysis in simulations and apply it to two omics data sets illustrating the integration of gene expression data, lipid concentrations and methylation levels.
Our whitening approach to CCA provides a unifying perspective on CCA, linking together sphering procedures, multivariate regression and corresponding probabilistic generative models. Furthermore, we offer an efficient computer implementation in the "whitening" R package available at https://CRAN.R-project.org/package=whitening .
典型相关分析(CCA)是一种用于研究复杂多元数据的经典统计工具。相应地,它已经找到了许多不同的应用,从分子生物学和医学到社会科学和金融。有趣的是,尽管 CCA 非常重要且普遍存在,但直到最近,才开始从算法角度发展出对 CCA 的概率理解,并将其应用于大规模场景。
在这里,我们从随机变量统计白化的角度重新审视 CCA,并提出了一种简单而灵活的 CCA 概率模型,其形式为两层潜在变量生成模型。这种概率 CCA 的变体的优点包括潜在变量的非模糊性、允许存在负典型相关、生成变量可以是非正态的,以及在模型的所有层面上易于解释。此外,我们表明,它可以在高维环境中使用正则化推断进行计算高效的估计。我们在模拟中测试了我们的 CCA 分析方法,并将其应用于两个组学数据集,说明了基因表达数据、脂质浓度和甲基化水平的整合。
我们的 CCA 白化方法为 CCA 提供了一个统一的视角,将球形化过程、多元回归和相应的概率生成模型联系在一起。此外,我们在“白化”R 包(可在 https://CRAN.R-project.org/package=whitening 获得)中提供了一种有效的计算机实现。