Mayer Claus-Dieter, Lorent Julie, Horgan Graham W
Biomathematics and Statistics Scotland.
Stat Appl Genet Mol Biol. 2011;10:Article 14. doi: 10.2202/1544-6115.1540.
The integration of multiple high-dimensional data sets (omics data) has been a very active but challenging area of bioinformatics research in recent years. Various adaptations of non-standard multivariate statistical tools have been suggested that allow to analyze and visualize such data sets simultaneously. However, these methods typically can deal with two data sets only, whereas systems biology experiments often generate larger numbers of high-dimensional data sets. For this reason, we suggest an explorative analysis of similarity between data sets as an initial analysis steps. This analysis is based on the RV coefficient, a matrix correlation, that can be interpreted as a generalization of the squared correlation from two single variables to two sets of variables. It has been shown before however that the high-dimensionality of the data introduces substantial bias to the RV. We therefore introduce an alternative version, the adjusted RV, which is unbiased in the case of independent data sets. We can also show that in many situations, particularly for very high-dimensional data sets, the adjusted RV is a better estimator than previously RV versions in terms of the mean square error and the power of the independence test based on it. We demonstrate the usefulness of the adjusted RV by applying it to data set of 19 different multivariate data sets from a systems biology experiment. The pairwise RV values between the data sets define a similarity matrix that we can use as an input to a hierarchical clustering or a multidimensional scaling. We show that this reveals biological meaningful subgroups of data sets in our study.
近年来,多个高维数据集(组学数据)的整合一直是生物信息学研究中一个非常活跃但具有挑战性的领域。人们提出了各种非标准多元统计工具的改编版本,以便能够同时分析和可视化这些数据集。然而,这些方法通常只能处理两个数据集,而系统生物学实验往往会产生更多数量的高维数据集。因此,我们建议将数据集之间相似性的探索性分析作为初始分析步骤。这种分析基于RV系数,一种矩阵相关性,它可以被解释为从两个单一变量的平方相关性到两组变量的一种推广。然而之前已经表明,数据的高维性会给RV带来显著偏差。因此,我们引入了一个替代版本,即调整后的RV,在独立数据集的情况下它是无偏的。我们还可以表明,在许多情况下,特别是对于非常高维的数据集,就均方误差和基于它的独立性检验的功效而言,调整后的RV比之前的RV版本是一个更好的估计量。我们通过将调整后的RV应用于来自一个系统生物学实验的19个不同多元数据集的数据集来证明其有用性。数据集之间的成对RV值定义了一个相似性矩阵,我们可以将其用作层次聚类或多维缩放的输入。我们表明,这揭示了我们研究中数据集具有生物学意义的亚组。