School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea.
Artificial Intelligence Graduate School, Gwangju Institute of Science and Technology, Gwangju, South Korea.
J Comput Biol. 2022 Aug;29(8):892-907. doi: 10.1089/cmb.2021.0598.
Integration of multi-omics data provides opportunities for revealing biological mechanisms related to certain phenotypes. We propose a novel method of multi-omics integration called supervised deep generalized canonical correlation analysis (SDGCCA) for modeling correlation structures between nonlinear multi-omics manifolds that aims at improving the classification of phenotypes and revealing the biomarkers related to phenotypes. SDGCCA addresses the limitations of other canonical correlation analysis (CCA)-based models (such as deep CCA, deep generalized CCA) by considering complex/nonlinear cross-data correlations between multiple (2) modalities. Although there are a few methods to learn nonlinear CCA projections for classifying phenotypes, they only consider two views. Methods extended to multiple views either do not perform classification or do not provide feature ranking. In contrast, SDGCCA is a nonlinear multi-view CCA projection method that performs classification and ranks features. When we applied SDGCCA in predicting patients with Alzheimer's disease (AD) and discrimination of early- and late-stage cancers, it outperformed other CCA-based and other supervised methods. In addition, we demonstrate that SDGCCA can be applied for feature selection to identify important multi-omics biomarkers. On applying AD data, SDGCCA identified clusters of genes in multi-omics data, well known to be associated with AD.
多组学数据的整合为揭示与某些表型相关的生物学机制提供了机会。我们提出了一种名为监督深度广义典型相关分析(SDGCCA)的新型多组学整合方法,用于对非线性多组学流形之间的相关结构进行建模,旨在改善表型分类和揭示与表型相关的生物标志物。SDGCCA 通过考虑多个模态之间的复杂/非线性交叉数据相关性,解决了其他基于典型相关分析(CCA)的模型(如深度 CCA、深度广义 CCA)的局限性。尽管有一些方法可以学习用于表型分类的非线性 CCA 投影,但它们仅考虑两个视图。扩展到多个视图的方法要么不执行分类,要么不提供特征排名。相比之下,SDGCCA 是一种用于分类和特征排名的非线性多视图 CCA 投影方法。当我们将 SDGCCA 应用于预测阿尔茨海默病(AD)患者和区分早期和晚期癌症时,它优于其他基于 CCA 的方法和其他监督方法。此外,我们证明了 SDGCCA 可用于特征选择,以识别重要的多组学生物标志物。在应用 AD 数据时,SDGCCA 识别了多组学数据中与 AD 相关的基因簇。