Computational and Functional Genomics Group, Centre for DNA Fingerprinting and Diagnostics, Hyderabad, Andhra Pradesh, India.
PLoS One. 2012;7(7):e42057. doi: 10.1371/journal.pone.0042057. Epub 2012 Jul 26.
Recent progress in computational methods for predicting physical and functional protein-protein interactions has provided new insights into the complexity of biological processes. Most of these methods assume that functionally interacting proteins are likely to have a shared evolutionary history. This history can be traced out for the protein pairs of a query genome by correlating different evolutionary aspects of their homologs in multiple genomes known as the reference genomes. These methods include phylogenetic profiling, gene neighborhood and co-occurrence of the orthologous protein coding genes in the same cluster or operon. These are collectively known as genomic context methods. On the other hand a method called mirrortree is based on the similarity of phylogenetic trees between two interacting proteins. Comprehensive performance analyses of these methods have been frequently reported in literature. However, very few studies provide insight into the effect of reference genome selection on detection of meaningful protein interactions.
We analyzed the performance of four methods and their variants to understand the effect of reference genome selection on prediction efficacy. We used six sets of reference genomes, sampled in accordance with phylogenetic diversity and relationship between organisms from 565 bacteria. We used Escherichia coli as a model organism and the gold standard datasets of interacting proteins reported in DIP, EcoCyc and KEGG databases to compare the performance of the prediction methods.
Higher performance for predicting protein-protein interactions was achievable even with 100-150 bacterial genomes out of 565 genomes. Inclusion of archaeal genomes in the reference genome set improves performance. We find that in order to obtain a good performance, it is better to sample few genomes of related genera of prokaryotes from the large number of available genomes. Moreover, such a sampling allows for selecting 50-100 genomes for comparable accuracy of predictions when computational resources are limited.
最近在预测物理和功能蛋白质-蛋白质相互作用的计算方法方面取得的进展,为生物过程的复杂性提供了新的见解。这些方法大多假设功能相互作用的蛋白质可能具有共同的进化历史。通过将同源蛋白在多个参考基因组中的不同进化方面进行关联,可以为查询基因组中的蛋白质对追踪这种历史。这些方法包括系统发生轮廓分析、基因邻居和同一簇或操纵子中同源蛋白质编码基因的共发生。这些方法统称为基因组背景方法。另一方面,一种称为mirrortree 的方法基于两个相互作用的蛋白质之间系统发生树的相似性。这些方法的综合性能分析经常在文献中报道。然而,很少有研究深入了解参考基因组选择对检测有意义的蛋白质相互作用的影响。
我们分析了四种方法及其变体的性能,以了解参考基因组选择对预测功效的影响。我们使用了六组参考基因组,按照从 565 种细菌中选择的系统发生多样性和生物体之间的关系进行采样。我们使用大肠杆菌作为模型生物,并使用 DIP、EcoCyc 和 KEGG 数据库中报告的相互作用蛋白质的黄金标准数据集来比较预测方法的性能。
即使从 565 个基因组中选择 100-150 个细菌基因组,也可以实现更高的蛋白质-蛋白质相互作用预测性能。在参考基因组集中包含古细菌基因组可以提高性能。我们发现,为了获得良好的性能,最好从大量可用基因组中选择少数相关原核生物属的基因组进行采样。此外,这种采样允许在计算资源有限的情况下,选择 50-100 个基因组以获得可比的预测准确性。