Department of Computer Science, USTO-MB University, BP 1505 El Mnaouer, Oran, Algeria.
Evol Bioinform Online. 2011;7:171-89. doi: 10.4137/EBO.S7931. Epub 2011 Oct 10.
Machine learning techniques have been widely applied to solve the problem of predicting protein secondary structure from the amino acid sequence. They have gained substantial success in this research area. Many methods have been used including k-Nearest Neighbors (k-NNs), Hidden Markov Models (HMMs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs), which have attracted attention recently. Today, the main goal remains to improve the prediction quality of the secondary structure elements. The prediction accuracy has been continuously improved over the years, especially by using hybrid or ensemble methods and incorporating evolutionary information in the form of profiles extracted from alignments of multiple homologous sequences. In this paper, we investigate how best to combine k-NNs, ANNs and Multi-class SVMs (M-SVMs) to improve secondary structure prediction of globular proteins. An ensemble method which combines the outputs of two feed-forward ANNs, k-NN and three M-SVM classifiers has been applied. Ensemble members are combined using two variants of majority voting rule. An heuristic based filter has also been applied to refine the prediction. To investigate how much improvement the general ensemble method can give rather than the individual classifiers that make up the ensemble, we have experimented with the proposed system on the two widely used benchmark datasets RS126 and CB513 using cross-validation tests by including PSI-BLAST position-specific scoring matrix (PSSM) profiles as inputs. The experimental results reveal that the proposed system yields significant performance gains when compared with the best individual classifier.
机器学习技术已广泛应用于解决从氨基酸序列预测蛋白质二级结构的问题。它们在该研究领域取得了很大的成功。许多方法已经被使用,包括 k-最近邻 (k-NN)、隐马尔可夫模型 (HMM)、人工神经网络 (ANN) 和支持向量机 (SVM),最近这些方法引起了关注。如今,主要目标仍然是提高二级结构元素的预测质量。近年来,预测准确性不断提高,特别是通过使用混合或集成方法,并以从多个同源序列比对中提取的轮廓形式纳入进化信息。在本文中,我们研究了如何最好地结合 k-NN、ANN 和多类 SVM (M-SVM) 来提高球状蛋白质的二级结构预测。应用了一种结合两个前馈 ANN、k-NN 和三个 M-SVM 分类器输出的集成方法。通过使用两种多数投票规则的变体对集成成员进行组合。还应用了基于启发式的过滤器来细化预测。为了研究通用集成方法相对于构成集成的各个分类器可以带来多大的改进,我们通过包括 PSI-BLAST 位置特定评分矩阵 (PSSM) 轮廓作为输入,在两个广泛使用的基准数据集 RS126 和 CB513 上进行了交叉验证测试,对所提出的系统进行了实验。实验结果表明,与最佳的单个分类器相比,所提出的系统在性能上有显著的提高。