Université Paris-Saclay, INRAE, AgroParisTech, GABI , 78350, Jouy-en-Josas, France.
Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA-Paris , 75005, Paris, France.
BMC Bioinformatics. 2021 Aug 4;22(1):392. doi: 10.1186/s12859-021-04303-4.
Integrating data from different sources is a recurring question in computational biology. Much effort has been devoted to the integration of data sets of the same type, typically multiple numerical data tables. However, data types are generally heterogeneous: it is a common place to gather data in the form of trees, networks or factorial maps, as these representations all have an appealing visual interpretation that helps to study grouping patterns and interactions between entities. The question we aim to answer in this paper is that of the integration of such representations.
To this end, we provide a simple procedure to compare data with various types, in particular trees or networks, that relies essentially on two steps: the first step projects the representations into a common coordinate system; the second step then uses a multi-table integration approach to compare the projected data. We rely on efficient and well-known methodologies for each step: the projection step is achieved by retrieving a distance matrix for each representation form and then applying multidimensional scaling to provide a new set of coordinates from all the pairwise distances. The integration step is then achieved by applying a multiple factor analysis to the multiple tables of the new coordinates. This procedure provides tools to integrate and compare data available, for instance, as tree or network structures. Our approach is complementary to kernel methods, traditionally used to answer the same question.
Our approach is evaluated on simulation and used to analyze two real-world data sets: first, we compare several clusterings for different cell-types obtained from a transcriptomics single-cell data set in mouse embryos; second, we use our procedure to aggregate a multi-table data set from the TCGA breast cancer database, in order to compare several protein networks inferred for different breast cancer subtypes.
整合来自不同来源的数据是计算生物学中反复出现的问题。人们已经投入了大量精力来整合同一类型的数据集,通常是多个数值数据表。然而,数据类型通常是异构的:以树、网络或因子图的形式收集数据是很常见的,因为这些表示形式都具有吸引人的可视化解释,可以帮助研究分组模式和实体之间的相互作用。我们在本文中要回答的问题是这些表示形式的整合。
为此,我们提供了一种简单的程序来比较具有各种类型的数据,特别是树或网络,该程序主要依赖于两个步骤:第一步将表示形式投影到公共坐标系中;第二步然后使用多表集成方法来比较投影数据。我们依赖于每个步骤的有效和知名方法:投影步骤通过为每个表示形式检索距离矩阵,然后应用多维尺度分析从所有成对距离提供新的坐标集来实现。然后通过对新坐标的多个表应用多因素分析来实现集成步骤。该过程提供了集成和比较可用数据的工具,例如树或网络结构。我们的方法是与核方法互补的,传统上用于回答相同的问题。
我们的方法在模拟中进行了评估,并用于分析两个真实世界的数据集:首先,我们比较了从老鼠胚胎转录组学单细胞数据集中获得的不同细胞类型的几种聚类;其次,我们使用我们的程序从 TCGA 乳腺癌数据库中聚合一个多表数据集,以便比较不同乳腺癌亚型推断出的几种蛋白质网络。