Kim Inyoung, Liu Yin, Zhao Hongyu
Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, Connecticut 06520, USA.
Biometrics. 2007 Sep;63(3):824-33. doi: 10.1111/j.1541-0420.2007.00755.x.
Protein-protein interactions (PPIs) play important roles in most fundamental cellular processes including cell cycle, metabolism, and cell proliferation. Therefore, the development of effective statistical approaches to predicting protein interactions based on recently available large-scale experimental data is very important. Because protein domains are the functional units of proteins and PPIs are mostly achieved through domain-domain interactions (DDIs), the modeling and analysis of protein interactions at the domain level may be more informative and insightful. However, due to the large number of domains, the number of parameters to be estimated is very large, yet the amount of information for statistical inference is quite limited. In this article we propose a full Bayesian method and a semi-Bayesian method for simultaneously estimating DDI probabilities, the false positive rate, and the false negative rate of high-throughput data through integrating data from several organisms. We also propose a model to associate protein interaction probabilities with domain interaction probabilities that reflects the number of domains in each protein. Our Bayesian methods are compared with the likelihood-based approach (Deng et al., 2002, Genome Research12, 1504-1508; Liu, Liu, and Zhao, 2005, Bioinformatics21, 3279-3285) developed using the expectation maximization algorithm. We show that the full Bayesian method has the smallest mean square error through both simulations and theoretical justification under a special scenario. The large-scale PPI data obtained from high-throughput yeast two-hybrid experiments are used to demonstrate the advantages of the Bayesian approaches.
蛋白质-蛋白质相互作用(PPIs)在包括细胞周期、新陈代谢和细胞增殖在内的大多数基本细胞过程中发挥着重要作用。因此,基于最近可得的大规模实验数据开发有效的统计方法来预测蛋白质相互作用非常重要。由于蛋白质结构域是蛋白质的功能单位,且蛋白质-蛋白质相互作用大多通过结构域-结构域相互作用(DDIs)实现,在结构域水平上对蛋白质相互作用进行建模和分析可能会更具信息性和洞察力。然而,由于结构域数量众多,待估计的参数数量非常大,而用于统计推断的信息量却相当有限。在本文中,我们提出了一种全贝叶斯方法和一种半贝叶斯方法,通过整合来自多个生物体的数据来同时估计高通量数据的结构域-结构域相互作用概率、假阳性率和假阴性率。我们还提出了一个模型,将蛋白质相互作用概率与结构域相互作用概率相关联,该模型反映了每个蛋白质中的结构域数量。我们将贝叶斯方法与使用期望最大化算法开发的基于似然的方法(Deng等人,2002年,《基因组研究》12卷,1504 - 1508页;Liu、Liu和Zhao,2005年,《生物信息学》21卷,3279 - 3285页)进行了比较。我们表明,在一种特殊情况下,通过模拟和理论论证,全贝叶斯方法具有最小的均方误差。从高通量酵母双杂交实验获得的大规模蛋白质-蛋白质相互作用数据被用于证明贝叶斯方法的优势。