Lei Zhengdeng, Dai Yang
Department of Bioengineering (MC063), University of Illinois at Chicago, 851 South Morgan Street, Chicago, IL 60607, USA.
BMC Bioinformatics. 2006 Nov 7;7:491. doi: 10.1186/1471-2105-7-491.
The accomplishment of the various genome sequencing projects resulted in accumulation of massive amount of gene sequence information. This calls for a large-scale computational method for predicting protein localization from sequence. The protein localization can provide valuable information about its molecular function, as well as the biological pathway in which it participates. The prediction of localization of a protein at subnuclear level is a challenging task. In our previous work we proposed an SVM-based system using protein sequence information for this prediction task. In this work, we assess protein similarity with Gene Ontology (GO) and then improve the performance of the system by adding a module of nearest neighbor classifier using a similarity measure derived from the GO annotation terms for protein sequences.
The performance of the new system proposed here was compared with our previous system using a set of proteins resided within 6 localizations collected from the Nuclear Protein Database (NPD). The overall MCC (accuracy) is elevated from 0.284 (50.0%) to 0.519 (66.5%) for single-localization proteins in leave-one-out cross-validation; and from 0.420 (65.2%) to 0.541 (65.2%) for an independent set of multi-localization proteins. The new system is available at http://array.bioengr.uic.edu/subnuclear.htm.
The prediction of protein subnuclear localizations can be largely influenced by various definitions of similarity for a pair of proteins based on different similarity measures of GO terms. Using the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produced the best predictive outcome. Substantial improvement in predicting protein subnuclear localizations has been achieved by combining Gene Ontology with sequence information.
各种基因组测序项目的完成导致了大量基因序列信息的积累。这就需要一种大规模的计算方法来从序列预测蛋白质定位。蛋白质定位可以提供有关其分子功能以及它所参与的生物途径的有价值信息。预测蛋白质在亚核水平的定位是一项具有挑战性的任务。在我们之前的工作中,我们提出了一个基于支持向量机的系统,使用蛋白质序列信息来完成这个预测任务。在这项工作中,我们通过基因本体论(GO)评估蛋白质相似性,然后通过添加一个最近邻分类器模块来提高系统性能,该模块使用从蛋白质序列的GO注释术语派生的相似性度量。
使用从核蛋白数据库(NPD)收集的6个定位内的一组蛋白质,将这里提出的新系统的性能与我们之前的系统进行了比较。在留一法交叉验证中,单定位蛋白质的总体马修斯相关系数(准确率)从0.284(50.0%)提高到0.519(66.5%);对于一组独立的多定位蛋白质,从0.420(65.2%)提高到0.541(65.2%)。新系统可在http://array.bioengr.uic.edu/subnuclear.htm获得。
基于GO术语的不同相似性度量,一对蛋白质的各种相似性定义在很大程度上会影响蛋白质亚核定位的预测。使用两个蛋白质匹配的GO术语对的相似性得分之和作为相似性定义产生了最佳预测结果。通过将基因本体论与序列信息相结合,在预测蛋白质亚核定位方面取得了显著改进。