基于遗传算法和特征选择的蛋白质折叠分类

Protein fold classification with genetic algorithms and feature selection.

作者信息

Chen Peng, Liu Chunmei, Burge Legand, Mahmood Mohammad, Southerland William, Gloster Clay

机构信息

Department of Systems and Computer Science, Howard University, 2300 Sixth Street, NW, Washington, DC 20059, USA.

出版信息

J Bioinform Comput Biol. 2009 Oct;7(5):773-88. doi: 10.1142/s0219720009004321.

DOI:10.1142/s0219720009004321

PMID:19785045

Abstract

Protein fold classification is a key step to predicting protein tertiary structures. This paper proposes a novel approach based on genetic algorithms and feature selection to classifying protein folds. Our dataset is divided into a training dataset and a test dataset. Each individual for the genetic algorithms represents a selection function of the feature vectors of the training dataset. A support vector machine is applied to each individual to evaluate the fitness value (fold classification rate) of each individual. The aim of the genetic algorithms is to search for the best individual that produces the highest fold classification rate. The best individual is then applied to the feature vectors of the test dataset and a support vector machine is built to classify protein folds based on selected features. Our experimental results on Ding and Dubchak's benchmark dataset of 27-class folds show that our approach achieves an accuracy of 71.28%, which outperforms current state-of-the-art protein fold predictors.

摘要

蛋白质折叠分类是预测蛋白质三级结构的关键步骤。本文提出了一种基于遗传算法和特征选择的蛋白质折叠分类新方法。我们的数据集被分为训练数据集和测试数据集。遗传算法的每个个体代表训练数据集特征向量的一个选择函数。将支持向量机应用于每个个体以评估其适应度值（折叠分类率）。遗传算法的目标是搜索产生最高折叠分类率的最佳个体。然后将最佳个体应用于测试数据集的特征向量，并构建支持向量机基于所选特征对蛋白质折叠进行分类。我们在丁和杜布恰克的27类折叠基准数据集上的实验结果表明，我们的方法准确率达到71.28%，优于当前最先进的蛋白质折叠预测器。