Pradhan Debasmita, Padhy Sudarsan, Sahoo Biswajit
Department of Computer Scienceing and Engineering, Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar, 751024, India.
Department of Computer Scienceing and Engineering, Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar, 751024, India.
Comput Biol Chem. 2017 Oct;70:211-219. doi: 10.1016/j.compbiolchem.2017.08.009. Epub 2017 Aug 31.
Proteins are the macromolecules responsible for almost all biological processes in a cell. With the availability of large number of protein sequences from different sequencing projects, the challenge with the scientist is to characterize their functions. As the wet lab methods are time consuming and expensive, many computational methods such as FASTA, PSI-BLAST, DNA microarray clustering, and Nearest Neighborhood classification on protein-protein interaction network have been proposed. Support vector machine is one such method that has been used successfully for several problems such as protein fold recognition, protein structure prediction etc. Cai et al. in 2003 have used SVM for classifying proteins into different functional classes and to predict their function. They used the physico-chemical properties of proteins to represent the protein sequences. In this paper a model comprising of feature subset selection followed by multiclass Support Vector Machine is proposed to determine the functional class of a newly generated protein sequence. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 enzyme classes are considered. To determine the features that contribute significantly for functional classification, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS), and SVM Recursive Feature Elimination (SVM-RFE) algorithms are used and it is observed that out of 32 properties considered initially, only 20 features are sufficient to classify the proteins into its functional classes with an accuracy ranging from 91% to 94%. On comparison it is seen that, OFS followed by SVM performs better than other methods. Our model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function.
蛋白质是负责细胞内几乎所有生物过程的大分子。随着来自不同测序项目的大量蛋白质序列的可得性,科学家面临的挑战是表征它们的功能。由于湿实验室方法既耗时又昂贵,因此已经提出了许多计算方法,如FASTA、PSI-BLAST、DNA微阵列聚类以及基于蛋白质-蛋白质相互作用网络的最近邻分类。支持向量机就是这样一种方法,它已成功用于解决诸如蛋白质折叠识别、蛋白质结构预测等多个问题。蔡等人在2003年使用支持向量机将蛋白质分类到不同的功能类别并预测其功能。他们使用蛋白质的物理化学性质来表示蛋白质序列。本文提出了一种由特征子集选择和多类支持向量机组成的模型,以确定新生成的蛋白质序列的功能类别。为了训练和测试该模型的性能,考虑了6种酶类中酶的32种物理化学性质。为了确定对功能分类有显著贡献的特征,使用了顺序向前浮动选择(SFFS)、正交向前选择(OFS)和支持向量机递归特征消除(SVM-RFE)算法,并且观察到,在最初考虑的32种性质中,只有20个特征足以将蛋白质分类到其功能类别中,准确率范围为91%至94%。通过比较可以看出,OFS后接支持向量机的方法比其他方法表现更好。我们的模型对现有模型进行了推广,以包括多类分类并识别影响蛋白质功能的最重要特征。