Shin Hyunjung, Lisewski Andreas Martin, Lichtarge Olivier
Department of Industrial & Information Systems Engineering, Ajou University, San 5, Wonchun-dong, Yeoungtong-gu, 443-749, Suwon, Korea.
Bioinformatics. 2007 Dec 1;23(23):3217-24. doi: 10.1093/bioinformatics/btm511. Epub 2007 Oct 31.
Predicting protein function is a central problem in bioinformatics, and many approaches use partially or fully automated methods based on various combination of sequence, structure and other information on proteins or genes. Such information establishes relationships between proteins that can be modelled most naturally as edges in graphs. A priori, however, it is often unclear which edges from which graph may contribute most to accurate predictions. For that reason, one established strategy is to integrate all available sources, or graphs as in graph integration, in the hope that the positive signals will add to each other. However, in the problem of functional prediction, noise, i.e. the presence of inaccurate or false edges, can still be large enough that integration alone has little effect on prediction accuracy. In order to reduce noise levels and to improve integration efficiency, we present here a recent method in graph-based learning, graph sharpening, which provides a theoretically firm yet intuitive and practical approach for disconnecting undesirable edges from protein similarity graphs. This approach has several attractive features: it is quick, scalable in the number of proteins, robust with respect to errors and tolerant of very diverse types of protein similarity measures.
We tested the classification accuracy in a test set of 599 proteins with remote sequence homology spread over 20 Gene Ontology (GO) functional classes. When compared to integration alone, graph sharpening plus integration of four vastly different molecular similarity measures improved the overall classification by nearly 30% [0.17 average increase in the area under the ROC curve (AUC)]. Moreover, and partially through the increased sparsity of the graphs induced by sharpening, this gain in accuracy came at negligible computational cost: sharpening and integration took on average 4.66 (+/-4.44) CPU seconds.
Software and Supplementary data will be available on http://mammoth.bcm.tmc.edu/
预测蛋白质功能是生物信息学中的核心问题,许多方法使用基于蛋白质或基因的序列、结构及其他信息的各种组合的部分或完全自动化方法。此类信息建立了蛋白质之间的关系,这些关系可以最自然地建模为图中的边。然而,先验地,通常不清楚来自哪个图的哪些边可能对准确预测贡献最大。因此,一种既定策略是整合所有可用来源,或如图谱整合那样整合图谱,希望正信号能够相互叠加。然而,在功能预测问题中,噪声,即不准确或错误边的存在,可能仍然足够大,以至于仅靠整合对预测准确性几乎没有影响。为了降低噪声水平并提高整合效率,我们在此介绍一种基于图谱学习的最新方法——图谱锐化,它为从蛋白质相似性图谱中分离不良边提供了一种理论上坚实且直观实用的方法。这种方法具有几个吸引人的特点:它速度快、在蛋白质数量上可扩展、对错误具有鲁棒性并且能容忍非常多样的蛋白质相似性度量类型。
我们在一个包含599个具有远距离序列同源性的蛋白质的测试集中测试了分类准确性,这些蛋白质分布在20个基因本体(GO)功能类别中。与仅进行整合相比,图谱锐化加上四种差异极大的分子相似性度量的整合将整体分类提高了近30%[ROC曲线下面积(AUC)平均增加0.17]。此外,部分通过锐化诱导的图谱稀疏性增加,这种准确性的提高是以可忽略不计的计算成本实现的:锐化和整合平均耗时4.66(±4.44)CPU秒。
软件和补充数据可在http://mammoth.bcm.tmc.edu/获取