基于两阶段多分类支持向量机的蛋白质二级结构预测

Prediction of Protein Secondary Structure with two-stage multi-class SVMs.

作者信息

Nguyen Minh N, Rajapakse Jagath C

机构信息

BioInformatics Research Centre, School of Computer Engineering, Nanyang Technological University, Singapore.

出版信息

Int J Data Min Bioinform. 2007;1(3):248-69. doi: 10.1504/ijdmb.2007.011612.

DOI:10.1504/ijdmb.2007.011612

PMID:18399074

Abstract

Bioinformatics techniques to Protein Secondary Structure (PSS) prediction mostly depend on the information available in amino acid sequences. In this paper, we propose a two-stage Multi-class Support Vector Machine (MSVM) approach, where the second MSVM predictor is introduced at the output of the first stage MSVM to capture the contextual relationship among secondary structure elements in order to minimise the generalisation error in the prediction. By using position-specific scoring matrices generated by PSI-BLAST, the two-stage MSVM approach achieves Q3 accuracies of 78.0% and 76.3% on the RS126 dataset of 126 non-homologous globular proteins and the CB396 dataset of 396 non-homologous proteins, respectively, which are better than the scores reported on both datasets to date. By using MSVM, the present prediction scheme significantly achieves 2-6% and 3-15% of improvement in Q3 and Sov accuracies, respectively, on the two datasets. On larger blind-test datasets from PSIPRED, CASP4 and EVA datasets, two-stage MSVM approach achieves Q3 accuracies from 77.0% to 79.5%.

摘要

用于蛋白质二级结构（PSS）预测的生物信息学技术主要依赖于氨基酸序列中可用的信息。在本文中，我们提出了一种两阶段多类支持向量机（MSVM）方法，其中第二个MSVM预测器在第一阶段MSVM的输出处引入，以捕捉二级结构元素之间的上下文关系，从而最小化预测中的泛化误差。通过使用PSI-BLAST生成的位置特异性评分矩阵，两阶段MSVM方法在包含126个非同源球蛋白的RS126数据集和包含396个非同源蛋白的CB396数据集上分别实现了78.0%和76.3%的Q3准确率，这优于迄今为止在这两个数据集上报告的分数。通过使用MSVM，当前的预测方案在这两个数据集上分别显著实现了Q3和Sov准确率提高2%-6%和3%-15%。在来自PSIPRED、CASP4和EVA数据集的更大的盲测数据集上，两阶段MSVM方法实现了77.0%至79.5%的Q3准确率。