Department of Mathematics, Imperial College London, London SW7 2AZ, UK.
Bioinformatics. 2020 Nov 1;36(17):4616-4625. doi: 10.1093/bioinformatics/btaa530.
Recent developments in technology have enabled researchers to collect multiple OMICS datasets for the same individuals. The conventional approach for understanding the relationships between the collected datasets and the complex trait of interest would be through the analysis of each OMIC dataset separately from the rest, or to test for associations between the OMICS datasets. In this work we show that integrating multiple OMICS datasets together, instead of analysing them separately, improves our understanding of their in-between relationships as well as the predictive accuracy for the tested trait. Several approaches have been proposed for the integration of heterogeneous and high-dimensional (p≫n) data, such as OMICS. The sparse variant of canonical correlation analysis (CCA) approach is a promising one that seeks to penalize the canonical variables for producing sparse latent variables while achieving maximal correlation between the datasets. Over the last years, a number of approaches for implementing sparse CCA (sCCA) have been proposed, where they differ on their objective functions, iterative algorithm for obtaining the sparse latent variables and make different assumptions about the original datasets.
Through a comparative study we have explored the performance of the conventional CCA proposed by Parkhomenko et al., penalized matrix decomposition CCA proposed by Witten and Tibshirani and its extension proposed by Suo et al. The aforementioned methods were modified to allow for different penalty functions. Although sCCA is an unsupervised learning approach for understanding of the in-between relationships, we have twisted the problem as a supervised learning one and investigated how the computed latent variables can be used for predicting complex traits. The approaches were extended to allow for multiple (more than two) datasets where the trait was included as one of the input datasets. Both ways have shown improvement over conventional predictive models that include one or multiple datasets.
https://github.com/theorod93/sCCA.
Supplementary data are available at Bioinformatics online.
最近技术的发展使研究人员能够为同一批个体收集多个 OMICS 数据集。从其余数据集中分别分析每个 OMIC 数据集或测试 OMICS 数据集之间的关联,是理解收集的数据集与感兴趣的复杂特征之间关系的传统方法。在这项工作中,我们表明,将多个 OMICS 数据集整合在一起,而不是分别分析它们,可以提高我们对它们之间关系的理解,以及对测试特征的预测准确性。已经提出了几种用于整合异构和高维(p≫n)数据(例如 OMICS)的方法。正则化典型相关分析(CCA)方法的稀疏变体是一种很有前途的方法,它试图通过在数据集之间实现最大相关性的同时,对典型变量进行惩罚以产生稀疏潜在变量。在过去的几年中,已经提出了许多用于实现稀疏 CCA(sCCA)的方法,它们在目标函数、获得稀疏潜在变量的迭代算法以及对原始数据集的不同假设方面有所不同。
通过比较研究,我们探讨了 Parkhomenko 等人提出的常规 CCA、Witten 和 Tibshirani 提出的惩罚矩阵分解 CCA 及其 Suo 等人提出的扩展方法的性能。上述方法被修改为允许使用不同的惩罚函数。虽然 sCCA 是一种用于理解中间关系的无监督学习方法,但我们将问题扭曲为监督学习问题,并研究了计算出的潜在变量如何用于预测复杂特征。这些方法被扩展到允许有多个(超过两个)数据集,其中特征被包括在输入数据集中之一。这两种方法都比包括一个或多个数据集的常规预测模型有了改进。
https://github.com/theorod93/sCCA。
补充数据可在 Bioinformatics 在线获取。