Sontag David, Singh Rohit, Berger Bonnie
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Pac Symp Biocomput. 2007:445-57.
We describe a novel probabilistic approach to estimating errors in two-hybrid (2H) experiments. Such experiments are frequently used to elucidate protein-protein interaction networks in a high-throughput fashion; however, a significant challenge with these is their relatively high error rate, specifically, a high false-positive rate. We describe a comprehensive error model for 2H data, accounting for both random and systematic errors. The latter arise from limitations of the 2H experimental protocol: in theory, the reporting mechanism of a 2H experiment should be activated if and only if the two proteins being tested truly interact; in practice, even in the absence of a true interaction, it may be activated by some proteins - either by themselves or through promiscuous interaction with other proteins. We describe a probabilistic relational model that explicitly models the above phenomenon and use Markov Chain Monte Carlo (MCMC) algorithms to compute both the probability of an observed 2H interaction being true as well as the probability of individual proteins being self-activating/promiscuous. This is the first approach that explicitly models systematic errors in protein-protein interaction data; in contrast, previous work on this topic has modeled errors as being independent and random. By explicitly modeling the sources of noise in 2H systems, we find that we are better able to make use of the available experimental data. In comparison with Bader et al.'s method for estimating confidence in 2H predicted interactions, the proposed method performed 5-10% better overall, and in particular regimes improved prediction accuracy by as much as 76%.
我们描述了一种用于估计双杂交(2H)实验误差的新型概率方法。此类实验经常被用于以高通量方式阐明蛋白质 - 蛋白质相互作用网络;然而,这些实验面临的一个重大挑战是其相对较高的错误率,特别是高假阳性率。我们为2H数据描述了一个综合误差模型,该模型考虑了随机误差和系统误差。系统误差源于2H实验方案的局限性:理论上,只有当被测试的两种蛋白质真正相互作用时,2H实验的报告机制才应被激活;但在实际中,即使不存在真正的相互作用,它也可能被某些蛋白质激活——要么是这些蛋白质自身,要么是通过与其他蛋白质的混杂相互作用。我们描述了一个概率关系模型,该模型明确地对上述现象进行建模,并使用马尔可夫链蒙特卡罗(MCMC)算法来计算观察到的2H相互作用为真的概率以及单个蛋白质自激活/混杂的概率。这是第一种明确对蛋白质 - 蛋白质相互作用数据中的系统误差进行建模的方法;相比之下,此前关于该主题的工作将误差建模为独立且随机的。通过明确对2H系统中的噪声源进行建模,我们发现能够更好地利用可用的实验数据。与巴德等人用于估计2H预测相互作用置信度的方法相比,所提出的方法总体上表现要好5 - 10%,并且在特定情况下预测准确率提高了多达76%。