Suppr
超能文献

基于过采样方法和周式广义伪氨基酸组成预测蛋白质亚细胞定位

Prediction of protein subcellular localization with oversampling approach and Chou's general PseAAC.

作者信息

Zhang Shengli, Duan Xin

机构信息

School of Mathematics and Statistics, Xidian University, Xi'an 710071, China.

出版信息

J Theor Biol. 2018 Jan 21;437:239-250. doi: 10.1016/j.jtbi.2017.10.030. Epub 2017 Oct 31.

DOI:10.1016/j.jtbi.2017.10.030

PMID:29100918

Abstract

Predicting protein subcellular location with support vector machine has been a popular research area recently because of the dramatic explosion of bioinformation. Though substantial achievements have been obtained, few researchers considered the problem of data imbalance before classification, which will lead to low accuracy for some categories. So in this work, we combined oversampling method with SVM to deal with the protein subcellular localization of unbalanced data sets. To capture valuable information of a protein, a PseAAC (Pseudo Amino Acid Composition) has been extracted from PSSM(Position-Specific Scoring Matrix) as a feature vector, and then be selected by principal component analysis (PCA). Next, samples which are treated by oversampling method to eliminate the imbalance of sample numbers in different classes are fed into support vector machine to predict the protein subcellular location. To evaluate the performance of proposed method, Jackknife tests are performed on three benchmark datasets (ZD98, CL317 and ZW225). Results of SVM experiments with and without oversampling gained by Jackknife tests show that oversampling methods have successfully decrease the imbalance of data sets, and the prediction accuracy of each class in each dataset is higher than 88.9%. With comparison with other protein subcellular localization methods, the method in this work reaches the best performance. The overall accuracies of ZD98, CL317 and ZW225 are 93.2%, 96.00% and 92.15% respectively, which are 2.4%, 8.0% and 8.2% higher than the best methods in the comparison. The excellent overall accuracy gained by the proposed method indicates that the feature representation and selection capture useful information of protein sequence and oversampling methods successfully solve the imbalance of sample numbers in SVM classification.

摘要

由于生物信息的急剧增长，利用支持向量机预测蛋白质亚细胞定位最近成为一个热门的研究领域。尽管已经取得了显著成就，但很少有研究人员在分类前考虑数据不平衡问题，这会导致某些类别的准确率较低。因此，在这项工作中，我们将过采样方法与支持向量机相结合，以处理不平衡数据集的蛋白质亚细胞定位问题。为了捕获蛋白质的有价值信息，从位置特异性得分矩阵（PSSM）中提取了伪氨基酸组成（PseAAC）作为特征向量，然后通过主成分分析（PCA）进行选择。接下来，将经过过采样方法处理以消除不同类别样本数量不平衡的样本输入支持向量机，以预测蛋白质亚细胞定位。为了评估所提出方法的性能，在三个基准数据集（ZD98、CL317和ZW225）上进行了留一法测试。留一法测试得到的支持向量机有无过采样实验结果表明，过采样方法成功减少了数据集的不平衡，每个数据集中每个类别的预测准确率都高于88.9%。与其他蛋白质亚细胞定位方法相比，这项工作中的方法达到了最佳性能。ZD98、CL317和ZW225的总体准确率分别为93.2%、96.00%和92.15%，分别比比较中的最佳方法高2.4%、8.0%和8.2%。所提出方法获得的优异总体准确率表明，特征表示和选择捕获了蛋白质序列的有用信息，过采样方法成功解决了支持向量机分类中样本数量不平衡的问题。

相似文献

Prediction of protein subcellular localization with oversampling approach and Chou's general PseAAC.

J Theor Biol. 2018 Jan 21;437:239-250. doi: 10.1016/j.jtbi.2017.10.030. Epub 2017 Oct 31.

Predicting apoptosis protein subcellular localization by integrating auto-cross correlation and PSSM into Chou's PseAAC.

J Theor Biol. 2018 Nov 14;457:163-169. doi: 10.1016/j.jtbi.2018.08.042. Epub 2018 Sep 1.

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features.

Molecules. 2019 Mar 6;24(5):919. doi: 10.3390/molecules24050919.

Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC.

J Theor Biol. 2019 Feb 7;462:230-239. doi: 10.1016/j.jtbi.2018.11.012. Epub 2018 Nov 16.

iAPSL-IF: Identification of Apoptosis Protein Subcellular Location Using Integrative Features Captured from Amino Acid Sequences.

Int J Mol Sci. 2018 Apr 13;19(4):1190. doi: 10.3390/ijms19041190.

Prediction of Protein Submitochondrial Locations by Incorporating Dipeptide Composition into Chou's General Pseudo Amino Acid Composition.

J Membr Biol. 2016 Jun;249(3):293-304. doi: 10.1007/s00232-015-9868-8. Epub 2016 Jan 8.

Prediction of apoptosis protein subcellular location based on position-specific scoring matrix and isometric mapping algorithm.

Med Biol Eng Comput. 2019 Dec;57(12):2553-2565. doi: 10.1007/s11517-019-02045-3. Epub 2019 Oct 16.

Prediction of Apoptosis Protein's Subcellular Localization by Fusing Two Different Descriptors Based on Evolutionary Information.

Acta Biotheor. 2018 Mar;66(1):61-78. doi: 10.1007/s10441-018-9319-x. Epub 2018 Mar 12.

Predict protein structural class by incorporating two different modes of evolutionary information into Chou's general pseudo amino acid composition.

J Mol Graph Model. 2017 Nov;78:110-117. doi: 10.1016/j.jmgm.2017.10.003. Epub 2017 Oct 7.

DPP-PseAAC: A DNA-binding protein prediction model using Chou's general PseAAC.

J Theor Biol. 2018 Sep 7;452:22-34. doi: 10.1016/j.jtbi.2018.05.006. Epub 2018 May 16.

引用本文的文献

BERT-DomainAFP: Antifreeze protein recognition and classification model based on BERT and structural domain annotation.

iScience. 2025 Mar 6;28(4):112077. doi: 10.1016/j.isci.2025.112077. eCollection 2025 Apr 18.

DNAPred_Prot: Identification of DNA-Binding Proteins Using Composition- and Position-Based Features.

Appl Bionics Biomech. 2022 Apr 13;2022:5483115. doi: 10.1155/2022/5483115. eCollection 2022.

Multiple Protein Subcellular Locations Prediction Based on Deep Convolutional Neural Networks with Self-Attention Mechanism.

Interdiscip Sci. 2022 Jun;14(2):421-438. doi: 10.1007/s12539-021-00496-7. Epub 2022 Jan 23.

Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences.

Med Biol Eng Comput. 2021 Nov;59(11-12):2297-2310. doi: 10.1007/s11517-021-02436-5. Epub 2021 Sep 20.

Machine and Deep Learning for Prediction of Subcellular Localization.

Methods Mol Biol. 2021;2361:249-261. doi: 10.1007/978-1-0716-1641-3_15.

Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization.

Life (Basel). 2021 Mar 30;11(4):293. doi: 10.3390/life11040293.

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors.

BMC Bioinformatics. 2020 Oct 27;21(1):480. doi: 10.1186/s12859-020-03826-6.

Self-evoluting framework of deep convolutional neural network for multilocus protein subcellular localization.

Med Biol Eng Comput. 2020 Dec;58(12):3017-3038. doi: 10.1007/s11517-020-02275-w. Epub 2020 Oct 20.

DeepPred-SubMito: A Novel Submitochondrial Localization Predictor Based on Multi-Channel Convolutional Neural Network and Dataset Balancing Treatment.

Int J Mol Sci. 2020 Aug 9;21(16):5710. doi: 10.3390/ijms21165710.

Subcellular location prediction of apoptosis proteins using two novel feature extraction methods based on evolutionary information and LDA.

BMC Bioinformatics. 2020 May 24;21(1):212. doi: 10.1186/s12859-020-3539-1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

基于过采样方法和周式广义伪氨基酸组成预测蛋白质亚细胞定位

Prediction of protein subcellular localization with oversampling approach and Chou's general PseAAC.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译