无监督多内核学习在异类数据集成中的应用。

Unsupervised multiple kernel learning for heterogeneous data integration.

机构信息

MIAT, Université de Toulouse, INRA, 31326 Castanet-Tolosan, France.

出版信息

Bioinformatics. 2018 Mar 15;34(6):1009-1015. doi: 10.1093/bioinformatics/btx682.

DOI:10.1093/bioinformatics/btx682

PMID:29077792

Abstract

MOTIVATION

Recent high-throughput sequencing advances have expanded the breadth of available omics datasets and the integrated analysis of multiple datasets obtained on the same samples has allowed to gain important insights in a wide range of applications. However, the integration of various sources of information remains a challenge for systems biology since produced datasets are often of heterogeneous types, with the need of developing generic methods to take their different specificities into account.

RESULTS

We propose a multiple kernel framework that allows to integrate multiple datasets of various types into a single exploratory analysis. Several solutions are provided to learn either a consensus meta-kernel or a meta-kernel that preserves the original topology of the datasets. We applied our framework to analyse two public multi-omics datasets. First, the multiple metagenomic datasets, collected during the TARA Oceans expedition, was explored to demonstrate that our method is able to retrieve previous findings in a single kernel PCA as well as to provide a new image of the sample structures when a larger number of datasets are included in the analysis. To perform this analysis, a generic procedure is also proposed to improve the interpretability of the kernel PCA in regards with the original data. Second, the multi-omics breast cancer datasets, provided by The Cancer Genome Atlas, is analysed using a kernel Self-Organizing Maps with both single and multi-omics strategies. The comparison of these two approaches demonstrates the benefit of our integration method to improve the representation of the studied biological system.

AVAILABILITY AND IMPLEMENTATION

Proposed methods are available in the R package mixKernel, released on CRAN. It is fully compatible with the mixOmics package and a tutorial describing the approach can be found on mixOmics web site http://mixomics.org/mixkernel/.

CONTACT

jerome.mariette@inra.fr or nathalie.villa-vialaneix@inra.fr.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

最近高通量测序技术的进步扩大了可用的组学数据集的范围，对同一样本获得的多个数据集进行综合分析，使得在广泛的应用中获得了重要的见解。然而，由于产生的数据集通常具有异构类型，因此整合各种来源的信息仍然是系统生物学的一个挑战，需要开发通用方法来考虑它们的不同特性。

结果

我们提出了一个多内核框架，允许将各种类型的多个数据集集成到单个探索性分析中。提供了几种解决方案来学习一致的元核或保留数据集原始拓扑的元核。我们将我们的框架应用于分析两个公开的多组学数据集。首先，探索了在 TARA 海洋考察期间收集的多个宏基因组数据集，以证明我们的方法不仅能够在单个核 PCA 中检索以前的发现，而且当分析中包含更多数据集时，还能够提供样本结构的新图像。为了执行此分析，还提出了一种通用程序来提高核 PCA 与原始数据的可解释性。其次，使用核自组织映射分析了来自癌症基因组图谱的多组学乳腺癌数据集，使用了单组学和多组学策略。这两种方法的比较证明了我们的集成方法能够改善所研究的生物系统的表示能力。