Hawkins Troy, Luban Stanislav, Kihara Daisuke
Department of Biological Sciences, College of Sciences, Purdue University, West Lafayette, Indiana 47907, USA.
Protein Sci. 2006 Jun;15(6):1550-6. doi: 10.1110/ps.062153506. Epub 2006 May 2.
The impetus for the recent development and emergence of automated function prediction methods is an exponentially growing flood of new experimental data, the interpretation of which is hindered by a shortage of reliable annotations for proteins that lack experimental characterization or significant homologs in current databases. Here we introduce PFP, an automated function prediction server that provides the most probable annotations for a query sequence in each of the three branches of the Gene Ontology: biological process, molecular function, and cellular component. Rather than utilizing precise pattern matching to identify functional motifs in the sequences and structures of these proteins, we designed PFP to increase the coverage of function annotation by lowering resolution of predictions when a detailed function is not predictable. To do this we extend a traditional PSI-BLAST search by extracting and scoring annotations (GO terms) individually, including annotations from distantly related sequences, and applying a novel data mining tool, the Function Association Matrix, to score strongly associated pairs of annotations. We show that PFP can correctly assign function using only weakly similar sequences with a significantly better accuracy and coverage than a standard PSI-BLAST search, improving it more than fivefold. The most descriptive annotations predicted by PFP (GO depth > or = 8) can identify a significant subgraph in the GO with > 60% accuracy and approximately 100% coverage for our benchmark set. We also provide examples of the superb performance of PFP in an assessment of automated function prediction servers at the Automated Function Prediction Special Interest Group meeting at ISMB 2005 (AFP-SIG '05).
近期自动化功能预测方法得以发展并出现的推动力,是新实验数据呈指数级增长的洪流,而对这些数据的解读因当前数据库中缺乏对缺乏实验表征或显著同源物的蛋白质的可靠注释而受阻。在此,我们介绍PFP,这是一个自动化功能预测服务器,它能为基因本体论的三个分支(生物过程、分子功能和细胞组成)中的每个查询序列提供最可能的注释。我们设计PFP并非利用精确的模式匹配来识别这些蛋白质的序列和结构中的功能基序,而是在详细功能不可预测时通过降低预测分辨率来增加功能注释的覆盖范围。为此,我们通过单独提取和评分注释(GO术语)来扩展传统的PSI-BLAST搜索,包括来自远缘相关序列的注释,并应用一种新颖的数据挖掘工具——功能关联矩阵,来对高度相关的注释对进行评分。我们表明,PFP仅使用弱相似序列就能正确地分配功能,其准确性和覆盖范围比标准的PSI-BLAST搜索显著更好,提升了五倍多。PFP预测的最具描述性的注释(GO深度≥8)能在基因本体论中识别出一个显著的子图,对于我们的基准集,其准确率>60%,覆盖率约为100%。我们还在2005年国际分子生物学大会的自动化功能预测特别兴趣小组会议(AFP-SIG '05)上对自动化功能预测服务器的评估中提供了PFP卓越性能的示例。