Suppr超能文献

基于过采样方法和周式广义伪氨基酸组成预测蛋白质亚细胞定位

Prediction of protein subcellular localization with oversampling approach and Chou's general PseAAC.

作者信息

Zhang Shengli, Duan Xin

机构信息

School of Mathematics and Statistics, Xidian University, Xi'an 710071, China.

School of Mathematics and Statistics, Xidian University, Xi'an 710071, China.

出版信息

J Theor Biol. 2018 Jan 21;437:239-250. doi: 10.1016/j.jtbi.2017.10.030. Epub 2017 Oct 31.

Abstract

Predicting protein subcellular location with support vector machine has been a popular research area recently because of the dramatic explosion of bioinformation. Though substantial achievements have been obtained, few researchers considered the problem of data imbalance before classification, which will lead to low accuracy for some categories. So in this work, we combined oversampling method with SVM to deal with the protein subcellular localization of unbalanced data sets. To capture valuable information of a protein, a PseAAC (Pseudo Amino Acid Composition) has been extracted from PSSM(Position-Specific Scoring Matrix) as a feature vector, and then be selected by principal component analysis (PCA). Next, samples which are treated by oversampling method to eliminate the imbalance of sample numbers in different classes are fed into support vector machine to predict the protein subcellular location. To evaluate the performance of proposed method, Jackknife tests are performed on three benchmark datasets (ZD98, CL317 and ZW225). Results of SVM experiments with and without oversampling gained by Jackknife tests show that oversampling methods have successfully decrease the imbalance of data sets, and the prediction accuracy of each class in each dataset is higher than 88.9%. With comparison with other protein subcellular localization methods, the method in this work reaches the best performance. The overall accuracies of ZD98, CL317 and ZW225 are 93.2%, 96.00% and 92.15% respectively, which are 2.4%, 8.0% and 8.2% higher than the best methods in the comparison. The excellent overall accuracy gained by the proposed method indicates that the feature representation and selection capture useful information of protein sequence and oversampling methods successfully solve the imbalance of sample numbers in SVM classification.

摘要

由于生物信息的急剧增长,利用支持向量机预测蛋白质亚细胞定位最近成为一个热门的研究领域。尽管已经取得了显著成就,但很少有研究人员在分类前考虑数据不平衡问题,这会导致某些类别的准确率较低。因此,在这项工作中,我们将过采样方法与支持向量机相结合,以处理不平衡数据集的蛋白质亚细胞定位问题。为了捕获蛋白质的有价值信息,从位置特异性得分矩阵(PSSM)中提取了伪氨基酸组成(PseAAC)作为特征向量,然后通过主成分分析(PCA)进行选择。接下来,将经过过采样方法处理以消除不同类别样本数量不平衡的样本输入支持向量机,以预测蛋白质亚细胞定位。为了评估所提出方法的性能,在三个基准数据集(ZD98、CL317和ZW225)上进行了留一法测试。留一法测试得到的支持向量机有无过采样实验结果表明,过采样方法成功减少了数据集的不平衡,每个数据集中每个类别的预测准确率都高于88.9%。与其他蛋白质亚细胞定位方法相比,这项工作中的方法达到了最佳性能。ZD98、CL317和ZW225的总体准确率分别为93.2%、96.00%和92.15%,分别比比较中的最佳方法高2.4%、8.0%和8.2%。所提出方法获得的优异总体准确率表明,特征表示和选择捕获了蛋白质序列的有用信息,过采样方法成功解决了支持向量机分类中样本数量不平衡的问题。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验