Yu Xiaojing, Cao Jianping, Cai Yudong, Shi Tieliu, Li Yixue
Bioinformatics Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Graduate School of the Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, PR China.
J Theor Biol. 2006 May 21;240(2):175-84. doi: 10.1016/j.jtbi.2005.09.018. Epub 2005 Nov 7.
In the post-genome era, the prediction of protein function is one of the most demanding tasks in the study of bioinformatics. Machine learning methods, such as the support vector machines (SVMs), greatly help to improve the classification of protein function. In this work, we integrated SVMs, protein sequence amino acid composition, and associated physicochemical properties into the study of nucleic-acid-binding proteins prediction. We developed the binary classifications for rRNA-, RNA-, DNA-binding proteins that play an important role in the control of many cell processes. Each SVM predicts whether a protein belongs to rRNA-, RNA-, or DNA-binding protein class. Self-consistency and jackknife tests were performed on the protein data sets in which the sequences identity was < 25%. Test results show that the accuracies of rRNA-, RNA-, DNA-binding SVMs predictions are approximately 84%, approximately 78%, approximately 72%, respectively. The predictions were also performed on the ambiguous and negative data set. The results demonstrate that the predicted scores of proteins in the ambiguous data set by RNA- and DNA-binding SVM models were distributed around zero, while most proteins in the negative data set were predicted as negative scores by all three SVMs. The score distributions agree well with the prior knowledge of those proteins and show the effectiveness of sequence associated physicochemical properties in the protein function prediction. The software is available from the author upon request.
在后基因组时代,蛋白质功能预测是生物信息学研究中最具挑战性的任务之一。机器学习方法,如支持向量机(SVM),极大地有助于改进蛋白质功能的分类。在这项工作中,我们将支持向量机、蛋白质序列氨基酸组成和相关物理化学性质整合到核酸结合蛋白预测的研究中。我们针对在许多细胞过程控制中起重要作用的rRNA结合蛋白、RNA结合蛋白和DNA结合蛋白开发了二元分类。每个支持向量机预测一种蛋白质是否属于rRNA结合蛋白、RNA结合蛋白或DNA结合蛋白类别。对序列同一性小于25%的蛋白质数据集进行了自一致性和留一法检验。测试结果表明,rRNA结合支持向量机、RNA结合支持向量机和DNA结合支持向量机预测的准确率分别约为84%、约78%、约72%。还对模糊数据集和阴性数据集进行了预测。结果表明,RNA结合支持向量机模型和DNA结合支持向量机模型对模糊数据集中蛋白质的预测分数分布在零附近,而阴性数据集中的大多数蛋白质被所有三种支持向量机预测为负分数。分数分布与这些蛋白质的先验知识非常吻合,表明序列相关物理化学性质在蛋白质功能预测中的有效性。如有需要,可向作者索取该软件。