Biotechnology Center, TU Dresden, Dresden, Germany.
PLoS One. 2009 Oct 22;4(10):e7492. doi: 10.1371/journal.pone.0007492.
During the last years gene interaction networks are increasingly being used for the assessment and interpretation of biological measurements. Knowledge of the interaction partners of an unknown protein allows scientists to understand the complex relationships between genetic products, helps to reveal unknown biological functions and pathways, and get a more detailed picture of an organism's complexity. Being able to measure all protein interactions under all relevant conditions is virtually impossible. Hence, computational methods integrating different datasets for predicting gene interactions are needed. However, when integrating different sources one has to account for the fact that some parts of the information may be redundant, which may lead to an overestimation of the true likelihood of an interaction. Our method integrates information derived from three different databases (Bioverse, HiMAP and STRING) for predicting human gene interactions. A Bayesian approach was implemented in order to integrate the different data sources on a common quantitative scale. An important assumption of the Bayesian integration is independence of the input data (features). Our study shows that the conditional dependency cannot be ignored when combining gene interaction databases that rely on partially overlapping input data. In addition, we show how the correlation structure between the databases can be detected and we propose a linear model to correct for this bias. Benchmarking the results against two independent reference data sets shows that the integrated model outperforms the individual datasets. Our method provides an intuitive strategy for weighting the different features while accounting for their conditional dependencies.
在过去的几年中,基因相互作用网络越来越多地被用于评估和解释生物测量数据。了解未知蛋白质的相互作用伙伴可以帮助科学家理解遗传产物之间的复杂关系,有助于揭示未知的生物学功能和途径,并更详细地了解生物体的复杂性。实际上,要在所有相关条件下测量所有蛋白质相互作用是不可能的。因此,需要整合不同数据集以预测基因相互作用的计算方法。但是,在整合不同来源时,必须考虑到某些信息可能是冗余的,这可能导致对真实相互作用可能性的高估。我们的方法整合了来自三个不同数据库(Bioverse、HiMAP 和 STRING)的信息,以预测人类基因相互作用。为了在共同的定量尺度上整合不同的数据源,我们实现了贝叶斯方法。贝叶斯整合的一个重要假设是输入数据(特征)之间的独立性。我们的研究表明,当组合依赖于部分重叠输入数据的基因相互作用数据库时,不能忽略条件依赖性。此外,我们展示了如何检测数据库之间的相关结构,并提出了一种线性模型来纠正这种偏差。将结果与两个独立的参考数据集进行基准测试表明,集成模型优于单个数据集。我们的方法提供了一种直观的策略来加权不同的特征,同时考虑它们的条件依赖性。