Hawkins Troy, Chitale Meghana, Luban Stanislav, Kihara Daisuke
Department of Biological Sciences, College of Science, Purdue University, West Lafayette, Indiana 47907, USA.
Proteins. 2009 Feb 15;74(3):566-82. doi: 10.1002/prot.22172.
Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http://dragon.bio.purdue.edu/pfp/.
蛋白质功能预测是生物信息学中的核心问题,由于等待解读的生物数据迅速积累,其重要性近来日益增加。序列数据是这批新数据的主体,显然是作为输入进行考量的目标,因为新测序的生物体通常缺乏任何其他类型的生物学特征描述。我们之前已引入PFP(蛋白质功能预测)作为基于序列的基因本体(GO)功能术语预测工具。PFP通过提取和评分各个功能属性、搜索广泛的E值序列匹配项以及利用传统数据挖掘技术来填补缺失信息,从而解读PSI-BLAST搜索结果。我们已经表明,在缺乏足够数据时,它在预测特定和低分辨率功能属性方面是有效的。在此,我们描述:(1)对PFP基础设施的重大改进,包括增加预测显著性和置信度得分;(2)对性能进行全面基准测试并与其他相关预测方法进行比较;(3)将PFP预测应用于基因组规模数据。我们将PFP预测应用于来自15个生物体的未表征蛋白质序列。在这些序列中,60% - 90%能够以高置信度(≥80%)用GO分子功能术语进行注释。我们还将预测应用于疟原虫(恶性疟原虫)的蛋白质 - 蛋白质相互作用网络。PFP给出的高置信度GO生物学过程预测(≥90%)使该数据集中完全富集的相互作用数量从相互作用总数的23%增加到94%。我们的基准比较表明,相对于GOtcha、InterProScan和PSI-BLAST预测,PFP的性能有显著提升。这与PFP在AFP-SIG '05和CASP7功能(FN)评估中作为总体最佳预测工具的表现一致。可通过http://dragon.bio.purdue.edu/pfp/以网络服务的形式获取PFP。