Yoon Grace, Müller Christian L, Gaynanova Irina
Department of Statistics, Texas A&M University, College Station, TX.
Center for Computational Mathematics, Flatiron Institute, New York, NY; Department of Statistics, LMU München, Munich, Germany; Institute of Computational Biology, Helmholtz Zentrum Munchen, Germany.
J Comput Graph Stat. 2021;30(4):1249-1256. doi: 10.1080/10618600.2021.1882468. Epub 2021 Mar 29.
Latent Gaussian copula models provide a powerful means to perform multi-view data integration since these models can seamlessly express dependencies between mixed variable types (binary, continuous, zero-inflated) via latent correlations. The estimation of these latent correlations, however, comes at considerable computational cost, having prevented the routine use of these models on high-dimensional data. Here, we propose a new computational approach for estimating latent correlations via a hybrid multilinear interpolation and optimization scheme. Our approach speeds up the current state of the art computation by several orders of magnitude, thus allowing fast computation of latent Gaussian copula models even when the number of variables is large. We provide theoretical guarantees for the approximation error of our numerical scheme and support its excellent performance on simulated and real-world data. We illustrate the practical advantages of our method on high-dimensional sparse quantitative and relative abundance microbiome data as well as multi-view data from The Cancer Genome Atlas Project. Our method is implemented in the R package mixedCCA, available at https://github.com/irinagain/mixedCCA.
潜在高斯 copula 模型提供了一种强大的方法来进行多视图数据集成,因为这些模型可以通过潜在相关性无缝地表达混合变量类型(二元、连续、零膨胀)之间的依赖关系。然而,这些潜在相关性的估计需要相当大的计算成本,这使得这些模型无法在高维数据上常规使用。在这里,我们提出了一种新的计算方法,通过混合多线性插值和优化方案来估计潜在相关性。我们的方法将当前的先进计算速度提高了几个数量级,从而即使在变量数量很大时也能快速计算潜在高斯 copula 模型。我们为我们的数值方案的近似误差提供了理论保证,并支持其在模拟数据和真实世界数据上的优异性能。我们在高维稀疏定量和相对丰度微生物组数据以及来自癌症基因组图谱项目的多视图数据上说明了我们方法的实际优势。我们的方法在 R 包 mixedCCA 中实现,可在 https://github.com/irinagain/mixedCCA 上获取。