Guo Jie, Wu Xiaomei, Zhang Da-Yong, Lin Kui
MOE Key Laboratory for Biodiversity Science and Ecological Engineering and College of Life Sciences, Beijing Normal University, Beijing 100875, China.
Nucleic Acids Res. 2008 Apr;36(6):2002-11. doi: 10.1093/nar/gkn016. Epub 2008 Feb 14.
High-throughput studies of protein interactions may have produced, experimentally and computationally, the most comprehensive protein-protein interaction datasets in the completely sequenced genomes. It provides us an opportunity on a proteome scale, to discover the underlying protein interaction patterns. Here, we propose an approach to discovering motif pairs at interaction sites (often 3-8 residues) that are essential for understanding protein functions and helpful for the rational design of protein engineering and folding experiments. A gold standard positive (interacting) dataset and a gold standard negative (non-interacting) dataset were mined to infer the interacting motif pairs that are significantly overrepresented in the positive dataset compared to the negative dataset. Four negative datasets assembled by different strategies were evaluated and the one with the best performance was used as the gold standard negatives for further analysis. Meanwhile, to assess the efficiency of our method in detecting potential interacting motif pairs, other approaches developed previously were compared, and we found that our method achieved the highest prediction accuracy. In addition, many uncharacterized motif pairs of interest were found to be functional with experimental evidence in other species. This investigation demonstrates the important effects of a high-quality negative dataset on the performance of such statistical inference.
蛋白质相互作用的高通量研究可能已经通过实验和计算方法,在全测序基因组中产生了最全面的蛋白质-蛋白质相互作用数据集。它为我们提供了一个在蛋白质组规模上发现潜在蛋白质相互作用模式的机会。在此,我们提出一种方法来发现相互作用位点(通常为3至8个残基)上的基序对,这些基序对对于理解蛋白质功能至关重要,并且有助于合理设计蛋白质工程和折叠实验。通过挖掘一个金标准阳性(相互作用)数据集和一个金标准阴性(非相互作用)数据集,以推断与阴性数据集相比在阳性数据集中显著富集的相互作用基序对。对通过不同策略组装的四个阴性数据集进行了评估,并将性能最佳的那个用作金标准阴性数据集进行进一步分析。同时,为了评估我们的方法在检测潜在相互作用基序对方面的效率,将其与先前开发的其他方法进行了比较,我们发现我们的方法实现了最高的预测准确性。此外,许多未表征的感兴趣基序对在其他物种中被发现具有实验证据支持的功能。这项研究证明了高质量阴性数据集对此类统计推断性能的重要影响。