Suppr超能文献

不均衡数据下的蛋白质分类

Protein classification with imbalanced data.

作者信息

Zhao Xing-Ming, Li Xin, Chen Luonan, Aihara Kazuyuki

机构信息

ERATO Aihara Complexity Modelling Project, JST, Tokyo 151-0064, Japan.

出版信息

Proteins. 2008 Mar;70(4):1125-32. doi: 10.1002/prot.21870.

Abstract

Generally, protein classification is a multi-class classification problem and can be reduced to a set of binary classification problems, where one classifier is designed for each class. The proteins in one class are seen as positive examples while those outside the class are seen as negative examples. However, the imbalanced problem will arise in this case because the number of proteins in one class is usually much smaller than that of the proteins outside the class. As a result, the imbalanced data cause classifiers to tend to overfit and to perform poorly in particular on the minority class. This article presents a new technique for protein classification with imbalanced data. First, we propose a new algorithm to overcome the imbalanced problem in protein classification with a new sampling technique and a committee of classifiers. Then, classifiers trained in different feature spaces are combined together to further improve the accuracy of protein classification. The numerical experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of accuracy. The Matlab code and supplementary materials are available at http://eserver2.sat.iis.u-tokyo.ac.jp/ approximately xmzhao/proteins.html.

摘要

一般来说,蛋白质分类是一个多类分类问题,可以简化为一组二元分类问题,其中为每个类别设计一个分类器。一类中的蛋白质被视为正例,而该类之外的蛋白质被视为负例。然而,在这种情况下会出现不平衡问题,因为一类中蛋白质的数量通常远小于该类之外蛋白质的数量。结果,不平衡数据导致分类器倾向于过拟合,并且在少数类上表现不佳。本文提出了一种处理不平衡数据的蛋白质分类新技术。首先,我们提出一种新算法,通过一种新的采样技术和一个分类器委员会来克服蛋白质分类中的不平衡问题。然后,将在不同特征空间中训练的分类器组合在一起,以进一步提高蛋白质分类的准确性。在基准数据集上的数值实验显示了有希望的结果,这证实了所提方法在准确性方面的有效性。Matlab代码和补充材料可在http://eserver2.sat.iis.u-tokyo.ac.jp/ approximately xmzhao/proteins.html获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验