Department of Computer Science and Engineering, University at Buffalo, 338 Davis Hall, Buffalo, 14260, NY, USA.
Department of Computer Science, University of Virginia, 509 Rice Hall, Charlottesville, 22904, VA, USA.
BMC Genomics. 2019 Dec 20;20(Suppl 11):944. doi: 10.1186/s12864-019-6285-x.
Comprehensive molecular profiling of various cancers and other diseases has generated vast amounts of multi-omics data. Each type of -omics data corresponds to one feature space, such as gene expression, miRNA expression, DNA methylation, etc. Integrating multi-omics data can link different layers of molecular feature spaces and is crucial to elucidate molecular pathways underlying various diseases. Machine learning approaches to mining multi-omics data hold great promises in uncovering intricate relationships among molecular features. However, due to the "big p, small n" problem (i.e., small sample sizes with high-dimensional features), training a large-scale generalizable deep learning model with multi-omics data alone is very challenging.
We developed a method called Multi-view Factorization AutoEncoder (MAE) with network constraints that can seamlessly integrate multi-omics data and domain knowledge such as molecular interaction networks. Our method learns feature and patient embeddings simultaneously with deep representation learning. Both feature representations and patient representations are subject to certain constraints specified as regularization terms in the training objective. By incorporating domain knowledge into the training objective, we implicitly introduced a good inductive bias into the machine learning model, which helps improve model generalizability. We performed extensive experiments on the TCGA datasets and demonstrated the power of integrating multi-omics data and biological interaction networks using our proposed method for predicting target clinical variables.
To alleviate the overfitting problem in deep learning on multi-omics data with the "big p, small n" problem, it is helpful to incorporate biological domain knowledge into the model as inductive biases. It is very promising to design machine learning models that facilitate the seamless integration of large-scale multi-omics data and biomedical domain knowledge for uncovering intricate relationships among molecular features and clinical features.
对各种癌症和其他疾病进行全面的分子谱分析产生了大量的多组学数据。每种类型的组学数据对应于一个特征空间,如基因表达、miRNA 表达、DNA 甲基化等。整合多组学数据可以连接不同层次的分子特征空间,对于阐明各种疾病的分子途径至关重要。挖掘多组学数据的机器学习方法在揭示分子特征之间复杂关系方面具有很大的潜力。然而,由于“大数据、小样本”问题(即小样本量和高维特征),仅使用多组学数据训练大规模可推广的深度学习模型非常具有挑战性。
我们开发了一种名为多视图因子分析自动编码器(MAE)的方法,该方法具有网络约束,可以无缝集成多组学数据和分子相互作用网络等领域知识。我们的方法使用深度表示学习同时学习特征和患者嵌入。特征表示和患者表示都受到训练目标中指定的正则化项的某些约束。通过将领域知识纳入训练目标,我们将良好的归纳偏差隐式引入机器学习模型中,这有助于提高模型的泛化能力。我们在 TCGA 数据集上进行了广泛的实验,证明了使用我们提出的方法整合多组学数据和生物相互作用网络来预测目标临床变量的强大功能。
为了缓解多组学数据中深度学习的过拟合问题,“大数据、小样本”问题,将生物领域知识纳入模型作为归纳偏差是有帮助的。设计能够促进大规模多组学数据和生物医学领域知识无缝集成的机器学习模型,以揭示分子特征和临床特征之间的复杂关系,具有很大的前景。