不均衡数据下的蛋白质分类

Protein classification with imbalanced data.

作者信息

Zhao Xing-Ming, Li Xin, Chen Luonan, Aihara Kazuyuki

机构信息

ERATO Aihara Complexity Modelling Project, JST, Tokyo 151-0064, Japan.

出版信息

Proteins. 2008 Mar;70(4):1125-32. doi: 10.1002/prot.21870.

DOI:10.1002/prot.21870

PMID:18076026

Abstract

Generally, protein classification is a multi-class classification problem and can be reduced to a set of binary classification problems, where one classifier is designed for each class. The proteins in one class are seen as positive examples while those outside the class are seen as negative examples. However, the imbalanced problem will arise in this case because the number of proteins in one class is usually much smaller than that of the proteins outside the class. As a result, the imbalanced data cause classifiers to tend to overfit and to perform poorly in particular on the minority class. This article presents a new technique for protein classification with imbalanced data. First, we propose a new algorithm to overcome the imbalanced problem in protein classification with a new sampling technique and a committee of classifiers. Then, classifiers trained in different feature spaces are combined together to further improve the accuracy of protein classification. The numerical experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of accuracy. The Matlab code and supplementary materials are available at http://eserver2.sat.iis.u-tokyo.ac.jp/ approximately xmzhao/proteins.html.

摘要

一般来说，蛋白质分类是一个多类分类问题，可以简化为一组二元分类问题，其中为每个类别设计一个分类器。一类中的蛋白质被视为正例，而该类之外的蛋白质被视为负例。然而，在这种情况下会出现不平衡问题，因为一类中蛋白质的数量通常远小于该类之外蛋白质的数量。结果，不平衡数据导致分类器倾向于过拟合，并且在少数类上表现不佳。本文提出了一种处理不平衡数据的蛋白质分类新技术。首先，我们提出一种新算法，通过一种新的采样技术和一个分类器委员会来克服蛋白质分类中的不平衡问题。然后，将在不同特征空间中训练的分类器组合在一起，以进一步提高蛋白质分类的准确性。在基准数据集上的数值实验显示了有希望的结果，这证实了所提方法在准确性方面的有效性。Matlab代码和补充材料可在http://eserver2.sat.iis.u-tokyo.ac.jp/ approximately xmzhao/proteins.html获取。

相似文献

Protein classification with imbalanced data.

Proteins. 2008 Mar;70(4):1125-32. doi: 10.1002/prot.21870.

Classification and knowledge discovery in protein databases.

J Biomed Inform. 2004 Aug;37(4):224-39. doi: 10.1016/j.jbi.2004.07.008.

A Protein Classification Benchmark collection for machine learning.

Nucleic Acids Res. 2007 Jan;35(Database issue):D232-6. doi: 10.1093/nar/gkl812. Epub 2006 Nov 16.

Variable predictive model based classification algorithm for effective separation of protein structural classes.

Comput Biol Chem. 2008 Aug;32(4):302-6. doi: 10.1016/j.compbiolchem.2008.03.009. Epub 2008 Apr 1.

Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy.

Evol Comput. 2009 Fall;17(3):275-306. doi: 10.1162/evco.2009.17.3.275.

New support vector-based design method for binary hierarchical classifiers for multi-class classification problems.

Neural Netw. 2008 Mar-Apr;21(2-3):502-10. doi: 10.1016/j.neunet.2007.12.005. Epub 2007 Dec 8.

Class-imbalanced classifiers for high-dimensional data.

Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.

Multi-class protein fold classification using a new ensemble machine learning approach.

Genome Inform. 2003;14:206-17.

Benchmarking protein classification algorithms via supervised cross-validation.

J Biochem Biophys Methods. 2008 Apr 24;70(6):1215-23. doi: 10.1016/j.jbbm.2007.05.011. Epub 2007 May 31.

Accuracy-based learning classifier systems: models, analysis and applications to classification tasks.

Evol Comput. 2003 Fall;11(3):209-38. doi: 10.1162/106365603322365289.

引用本文的文献

Exploring the Potential of GANs in Biological Sequence Analysis.

Biology (Basel). 2023 Jun 14;12(6):854. doi: 10.3390/biology12060854.

Whole-Tumor ADC Texture Analysis Is Able to Predict Breast Cancer Receptor Status.

Diagnostics (Basel). 2023 Apr 14;13(8):1414. doi: 10.3390/diagnostics13081414.

PASSer: fast and accurate prediction of protein allosteric sites.

Nucleic Acids Res. 2023 Jul 5;51(W1):W427-W431. doi: 10.1093/nar/gkad303.

Machine learning to improve the interpretation of intercalating dye-based quantitative PCR results.

Sci Rep. 2022 Sep 30;12(1):16445. doi: 10.1038/s41598-022-21010-z.

PASSer: Prediction of Allosteric Sites Server.

Mach Learn Sci Technol. 2021 Sep;2(3). doi: 10.1088/2632-2153/abe6d6. Epub 2021 May 13.

DeepCAPE: A Deep Convolutional Neural Network for the Accurate Prediction of Enhancers.

Genomics Proteomics Bioinformatics. 2021 Aug;19(4):565-577. doi: 10.1016/j.gpb.2019.04.006. Epub 2021 Feb 11.

Decoy selection for protein structure prediction via extreme gradient boosting and ranking.

BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):189. doi: 10.1186/s12859-020-3523-9.

Chronic Kidney Disease stratification using office visit records: Handling data imbalance via hierarchical meta-classification.

BMC Med Inform Decis Mak. 2018 Dec 12;18(Suppl 4):125. doi: 10.1186/s12911-018-0675-x.

Machine learning in computational biology to accelerate high-throughput protein expression.

Bioinformatics. 2017 Aug 15;33(16):2487-2495. doi: 10.1093/bioinformatics/btx207.

iDPF-PseRAAAC: A Web-Server for Identifying the Defensin Peptide Family and Subfamily Using Pseudo Reduced Amino Acid Alphabet Composition.

PLoS One. 2015 Dec 29;10(12):e0145541. doi: 10.1371/journal.pone.0145541. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

不均衡数据下的蛋白质分类

Protein classification with imbalanced data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献