通过统计学习方法在不使用序列相似性的情况下预测新型细菌蛋白质的功能类别。

Prediction of functional class of novel bacterial proteins without the use of sequence similarity by a statistical learning method.

作者信息

Cui J, Han L Y, Cai C Z, Zheng C J, Ji Z L, Chen Y Z

机构信息

Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Singapore.

出版信息

J Mol Microbiol Biotechnol. 2005;9(2):86-100. doi: 10.1159/000088839.

DOI:10.1159/000088839

PMID:16319498

Abstract

A substantial percentage of the putative protein-encoding open reading frames (ORFs) in bacterial genomes have no homolog of known function, and their function cannot be confidently assigned on the basis of sequence similarity. Methods not based on sequence similarity are needed and being developed. One method, SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi), predicts protein functional family irrespective of sequence similarity (Nucleic Acids Res. 2003;31:3692-3697). While it has been tested on a large number of proteins, its capability for non-homologous proteins has so far been evaluated for a relatively small number of proteins, and additional tests are needed to more fully assess SVMProt. In this work, 90 novel bacterial proteins (non-homologous to known proteins) are used to evaluate the capability of SVMProt. These proteins are such that none of their homologs are in the Swiss-Prot database, their functions not clearly described in the literature, and they themselves and their homologs are not included in the training sets of SVMProt. They represent proteins whose function cannot be confidently predicted by sequence similarity methods at present. The predicted functional class of 76.7% of each of these proteins shows various levels of consistency with the literature-described function, compared to the overall accuracy of 87% for the SVMProt functional class assignment of 34,582 proteins that have at least one homolog of known function. Our study suggests that SVMProt is capable of assigning functional class for novel bacterial proteins at a level not too much lower than that of sequence alignment methods for homologous proteins.

摘要

细菌基因组中相当大比例的假定蛋白质编码开放阅读框（ORF）没有已知功能的同源物，并且无法根据序列相似性可靠地确定其功能。因此需要并正在开发不基于序列相似性的方法。一种方法是SVMProt（http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi），它可以预测蛋白质功能家族，而不考虑序列相似性（《核酸研究》，2003年；31:3692 - 3697）。虽然它已经在大量蛋白质上进行了测试，但到目前为止，其对非同源蛋白质的能力仅针对相对少量的蛋白质进行了评估，还需要更多测试来更全面地评估SVMProt。在这项工作中，使用了90种新型细菌蛋白质（与已知蛋白质无同源性）来评估SVMProt的能力。这些蛋白质在瑞士蛋白质数据库中没有同源物，其功能在文献中也没有明确描述，并且它们自身及其同源物都不包含在SVMProt的训练集中。它们代表了目前无法通过序列相似性方法可靠预测功能的蛋白质。与对34582种具有至少一种已知功能同源物的蛋白质进行SVMProt功能分类的总体准确率87%相比，这些蛋白质中每种蛋白质的76.7%的预测功能类别与文献描述的功能显示出不同程度的一致性。我们的研究表明，SVMProt能够为新型细菌蛋白质分配功能类别，其水平与同源蛋白质的序列比对方法相比不会低太多。