评估图形高斯模型的有效性域，以便推断复杂生物系统各组成部分之间的关系。

Assessing the validity domains of graphical Gaussian models in order to infer relationships among components of complex biological systems.

作者信息

Villers Fanny, Schaeffer Brigitte, Bertin Caroline, Huet Sylvie

机构信息

INRA Jouy-En-Josas.

出版信息

Stat Appl Genet Mol Biol. 2008;7(1):Article 14. doi: 10.2202/1544-6115.1371. Epub 2008 Sep 11.

DOI:10.2202/1544-6115.1371

PMID:18976229

Abstract

The study of the interactions of cellular components is an essential base step to understand the structure and dynamics of biological networks. Various methods were recently developed for this purpose. While most of them combine different types of data and a priori knowledge, methods based on graphical Gaussian models are capable of learning the network directly from raw data. They consider the full-order partial correlations which are partial correlations between two variables given the remaining ones, for modeling direct links between variables. Statistical methods were developed for estimating these links when the number of observations is larger than the number of variables. However, the rapid advance of new technologies that allow the simultaneous measure of genome expression, led to large-scale datasets where the number of variables is far larger than the number of observations. To get around this dimensionality problem, different strategies and new statistical methods were proposed. In this study we focused on statistical methods recently published. All are based on the fact that the number of direct relationships between two variables is very small in regards to the number of possible relationships, p(p-1)/2. In the biological context, this assumption is not always satisfied over the whole graph. It is essential to precisely know the behavior of the methods in regards to the characteristics of the studied object before applying them. For this purpose, we evaluated the validity domain of each method from wide-ranging simulated datasets. We then illustrated our results using recently published biological data.

摘要

细胞成分相互作用的研究是理解生物网络结构和动态的重要基础步骤。最近为此目的开发了各种方法。虽然其中大多数方法结合了不同类型的数据和先验知识，但基于图形高斯模型的方法能够直接从原始数据中学习网络。它们考虑全阶偏相关性，即给定其余变量时两个变量之间的偏相关性，用于对变量之间的直接联系进行建模。当观测值的数量大于变量的数量时，开发了统计方法来估计这些联系。然而，允许同时测量基因组表达的新技术的迅速发展，导致了大规模数据集，其中变量的数量远远大于观测值的数量。为了解决这个维度问题，人们提出了不同的策略和新的统计方法。在本研究中，我们重点关注最近发表的统计方法。所有这些方法都基于这样一个事实，即两个变量之间的直接关系数量相对于可能关系的数量p(p - 1)/2来说非常少。在生物学背景下，在整个图上这个假设并不总是成立。在应用这些方法之前，准确了解它们相对于所研究对象特征的行为至关重要。为此，我们从广泛的模拟数据集中评估了每种方法的有效域。然后，我们使用最近发表的生物学数据说明了我们的结果。