Patil Ashwini, Nakamura Haruki
Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan.
BMC Bioinformatics. 2005 Apr 18;6:100. doi: 10.1186/1471-2105-6-100.
Protein-protein interaction data used in the creation or prediction of molecular networks is usually obtained from large scale or high-throughput experiments. This experimental data is liable to contain a large number of spurious interactions. Hence, there is a need to validate the interactions and filter out the incorrect data before using them in prediction studies.
In this study, we use a combination of 3 genomic features -- structurally known interacting Pfam domains, Gene Ontology annotations and sequence homology -- as a means to assign reliability to the protein-protein interactions in Saccharomyces cerevisiae determined by high-throughput experiments. Using Bayesian network approaches, we show that protein-protein interactions from high-throughput data supported by one or more genomic features have a higher likelihood ratio and hence are more likely to be real interactions. Our method has a high sensitivity (90%) and good specificity (63%). We show that 56% of the interactions from high-throughput experiments in Saccharomyces cerevisiae have high reliability. We use the method to estimate the number of true interactions in the high-throughput protein-protein interaction data sets in Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens to be 27%, 18% and 68% respectively. Our results are available for searching and downloading at http://helix.protein.osaka-u.ac.jp/htp/.
A combination of genomic features that include sequence, structure and annotation information is a good predictor of true interactions in large and noisy high-throughput data sets. The method has a very high sensitivity and good specificity and can be used to assign a likelihood ratio, corresponding to the reliability, to each interaction.
用于构建或预测分子网络的蛋白质 - 蛋白质相互作用数据通常来自大规模或高通量实验。这种实验数据容易包含大量虚假相互作用。因此,在将其用于预测研究之前,有必要验证这些相互作用并过滤掉错误数据。
在本研究中,我们使用3种基因组特征的组合——结构已知的相互作用Pfam结构域、基因本体注释和序列同源性——作为一种手段,来确定由高通量实验测定的酿酒酵母中蛋白质 - 蛋白质相互作用的可靠性。使用贝叶斯网络方法,我们表明由一种或多种基因组特征支持的高通量数据中的蛋白质 - 蛋白质相互作用具有更高的似然比,因此更有可能是真实的相互作用。我们的方法具有高灵敏度(90%)和良好的特异性(63%)。我们表明,酿酒酵母高通量实验中的56%的相互作用具有高可靠性。我们使用该方法估计秀丽隐杆线虫、黑腹果蝇和智人的高通量蛋白质 - 蛋白质相互作用数据集中真实相互作用的数量分别为27%、18%和68%。我们的结果可在http://helix.protein.osaka-u.ac.jp/htp/上搜索和下载。
包括序列、结构和注释信息的基因组特征组合是大型且有噪声的高通量数据集中真实相互作用的良好预测指标。该方法具有非常高的灵敏度和良好的特异性,可以用于为每个相互作用赋予一个与可靠性相对应的似然比。