Shamim Mohammad Tabrez Anwar, Anwaruddin Mohammad, Nagarajaram H A
Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Hyderabad 500 076, India.
Bioinformatics. 2007 Dec 15;23(24):3320-7. doi: 10.1093/bioinformatics/btm527. Epub 2007 Nov 7.
Fold recognition is a key step in the protein structure discovery process, especially when traditional sequence comparison methods fail to yield convincing structural homologies. Although many methods have been developed for protein fold recognition, their accuracies remain low. This can be attributed to insufficient exploitation of fold discriminatory features.
We have developed a new method for protein fold recognition using structural information of amino acid residues and amino acid residue pairs. Since protein fold recognition can be treated as a protein fold classification problem, we have developed a Support Vector Machine (SVM) based classifier approach that uses secondary structural state and solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors. Among the individual properties examined secondary structural state frequencies of amino acids gave an overall accuracy of 65.2% for fold discrimination, which is better than the accuracy by any method reported so far in the literature. Combination of secondary structural state frequencies with solvent accessibility state frequencies of amino acids and amino acid pairs further improved the fold discrimination accuracy to more than 70%, which is approximately 8% higher than the best available method. In this study we have also tested, for the first time, an all-together multi-class method known as Crammer and Singer method for protein fold classification. Our studies reveal that the three multi-class classification methods, namely one versus all, one versus one and Crammer and Singer method, yield similar predictions.
Dataset and stand-alone program are available upon request.
折叠识别是蛋白质结构发现过程中的关键步骤,特别是当传统序列比较方法无法得出令人信服的结构同源性时。尽管已经开发了许多用于蛋白质折叠识别的方法,但其准确性仍然较低。这可归因于对折叠鉴别特征的利用不足。
我们开发了一种利用氨基酸残基和氨基酸残基对的结构信息进行蛋白质折叠识别的新方法。由于蛋白质折叠识别可被视为蛋白质折叠分类问题,我们开发了一种基于支持向量机(SVM)的分类器方法,该方法使用氨基酸和氨基酸对的二级结构状态和溶剂可及性状态频率作为特征向量。在所研究的各个属性中,氨基酸的二级结构状态频率在折叠鉴别方面的总体准确率为65.2%,这优于文献中迄今报道的任何方法的准确率。将氨基酸的二级结构状态频率与氨基酸和氨基酸对的溶剂可及性状态频率相结合,进一步将折叠鉴别准确率提高到70%以上,比现有最佳方法高出约8%。在本研究中,我们还首次测试了一种称为Crammer和Singer方法的全多类方法用于蛋白质折叠分类。我们的研究表明,三种多类分类方法,即一对多、一对一和Crammer和Singer方法,产生相似的预测结果。
可根据要求提供数据集和独立程序。