Lee Hyunju, Deng Minghua, Sun Fengzhu, Chen Ting
Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
BMC Bioinformatics. 2006 May 25;7:269. doi: 10.1186/1471-2105-7-269.
The development of high-throughput technologies has produced several large scale protein interaction data sets for multiple species, and significant efforts have been made to analyze the data sets in order to understand protein activities. Considering that the basic units of protein interactions are domain interactions, it is crucial to understand protein interactions at the level of the domains. The availability of many diverse biological data sets provides an opportunity to discover the underlying domain interactions within protein interactions through an integration of these biological data sets.
We combine protein interaction data sets from multiple species, molecular sequences, and gene ontology to construct a set of high-confidence domain-domain interactions. First, we propose a new measure, the expected number of interactions for each pair of domains, to score domain interactions based on protein interaction data in one species and show that it has similar performance as the E-value defined by Riley et al. Our new measure is applied to the protein interaction data sets from yeast, worm, fruitfly and humans. Second, information on pairs of domains that coexist in known proteins and on pairs of domains with the same gene ontology function annotations are incorporated to construct a high-confidence set of domain-domain interactions using a Bayesian approach. Finally, we evaluate the set of domain-domain interactions by comparing predicted domain interactions with those defined in iPfam database that were derived based on protein structures. The accuracy of predicted domain interactions are also confirmed by comparing with experimentally obtained domain interactions from H. pylori. As a result, a total of 2,391 high-confidence domain interactions are obtained and these domain interactions are used to unravel detailed protein and domain interactions in several protein complexes.
Our study shows that integration of multiple biological data sets based on the Bayesian approach provides a reliable framework to predict domain interactions. By integrating multiple data sources, the coverage and accuracy of predicted domain interactions can be significantly increased.
高通量技术的发展已产生了多个物种的几个大规模蛋白质相互作用数据集,并且已经做出了重大努力来分析这些数据集以了解蛋白质活性。考虑到蛋白质相互作用的基本单位是结构域相互作用,在结构域水平上理解蛋白质相互作用至关重要。许多不同生物数据集的可用性提供了一个机会,可通过整合这些生物数据集来发现蛋白质相互作用中潜在的结构域相互作用。
我们结合了来自多个物种的蛋白质相互作用数据集、分子序列和基因本体,以构建一组高可信度的结构域-结构域相互作用。首先,我们提出了一种新的度量方法,即每对结构域相互作用的预期数量,以基于一个物种中的蛋白质相互作用数据对结构域相互作用进行评分,并表明它具有与Riley等人定义的E值相似的性能。我们的新度量方法应用于来自酵母、线虫、果蝇和人类的蛋白质相互作用数据集。其次,纳入已知蛋白质中共存的结构域对信息以及具有相同基因本体功能注释的结构域对信息,使用贝叶斯方法构建一组高可信度的结构域-结构域相互作用。最后,我们通过将预测的结构域相互作用与iPfam数据库中基于蛋白质结构定义的相互作用进行比较,来评估结构域-结构域相互作用集。通过与从幽门螺杆菌实验获得的结构域相互作用进行比较也证实了预测结构域相互作用的准确性。结果,总共获得了2391个高可信度的结构域相互作用,并且这些结构域相互作用用于揭示几种蛋白质复合物中详细的蛋白质和结构域相互作用。
我们的研究表明,基于贝叶斯方法整合多个生物数据集为预测结构域相互作用提供了一个可靠的框架。通过整合多个数据源,可以显著提高预测结构域相互作用的覆盖范围和准确性。