使用两阶段支持向量机的蛋白质拓扑结构分类

Protein topology classification using two-stage support vector machines.

作者信息

Gubbi Jayavardhana, Shilton Alistair, Parker Michael, Palaniswami Marimuthu

机构信息

Department of Electrical and Electronics Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia.

出版信息

Genome Inform. 2006;17(2):259-69.

PMID:17503398

Abstract

The determination of the first 3-D model of a protein from its sequence alone is a non-trivial problem. The first 3-D model is the key to the molecular replacement method of solving phase problem in x-ray crystallography. If the sequence identity is more than 30%, homology modelling can be used to determine the correct topology (as defined by CATH) or fold (as defined by SCOP). If the sequence identity is less than 25%, however, the task is very challenging. In this paper we address the topology classification of proteins with sequence identity of less than 25%. The input information to the system is amino acid sequence, the predicted secondary structure and the predicted real value relative solvent accessibility. A two stage support vector machine (SVM) approach is proposed for classifying the sequences to three different structural classes (alpha, beta, alpha+beta) in the first stage and 39 topologies in the second stage. The method is evaluated using a newly curated dataset from CATH with maximum pairwise sequence identity less than 25%. An impressive overall accuracy of 87.44% and 83.15% is reported for class and topology prediction, respectively. In the class prediction stage, a sensitivity of 0.77 and a specificity of 0.91 is obtained. Data file, SVM implementation (SVMHEAVY) and result files can be downloaded from http://www.ee.unimelb.edu.au/ISSNIP/downloads/.

摘要

仅根据蛋白质序列确定其首个三维模型是一个复杂的问题。首个三维模型是X射线晶体学中解决相位问题的分子置换方法的关键。如果序列同一性超过30%，则可以使用同源建模来确定正确的拓扑结构（由CATH定义）或折叠方式（由SCOP定义）。然而，如果序列同一性小于25%，这项任务就极具挑战性。在本文中，我们探讨了序列同一性小于25%的蛋白质的拓扑分类问题。系统的输入信息是氨基酸序列、预测的二级结构和预测的相对溶剂可及性实值。提出了一种两阶段支持向量机（SVM）方法，在第一阶段将序列分类为三种不同的结构类别（α、β、α+β），在第二阶段分类为39种拓扑结构。使用来自CATH的一个新整理的数据集对该方法进行评估，该数据集的最大成对序列同一性小于25%。据报道，在类别和拓扑预测方面，总体准确率分别达到了令人印象深刻的87.44%和83.15%。在类别预测阶段，灵敏度为0.77，特异性为0.91。数据文件、SVM实现（SVMHEAVY）和结果文件可从http://www.ee.unimelb.edu.au/ISSNIP/downloads/下载。