Zhao Xing-Ming, Li Xin, Chen Luonan, Aihara Kazuyuki
ERATO Aihara Complexity Modelling Project, JST, Tokyo 151-0064, Japan.
Proteins. 2008 Mar;70(4):1125-32. doi: 10.1002/prot.21870.
Generally, protein classification is a multi-class classification problem and can be reduced to a set of binary classification problems, where one classifier is designed for each class. The proteins in one class are seen as positive examples while those outside the class are seen as negative examples. However, the imbalanced problem will arise in this case because the number of proteins in one class is usually much smaller than that of the proteins outside the class. As a result, the imbalanced data cause classifiers to tend to overfit and to perform poorly in particular on the minority class. This article presents a new technique for protein classification with imbalanced data. First, we propose a new algorithm to overcome the imbalanced problem in protein classification with a new sampling technique and a committee of classifiers. Then, classifiers trained in different feature spaces are combined together to further improve the accuracy of protein classification. The numerical experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of accuracy. The Matlab code and supplementary materials are available at http://eserver2.sat.iis.u-tokyo.ac.jp/ approximately xmzhao/proteins.html.
一般来说,蛋白质分类是一个多类分类问题,可以简化为一组二元分类问题,其中为每个类别设计一个分类器。一类中的蛋白质被视为正例,而该类之外的蛋白质被视为负例。然而,在这种情况下会出现不平衡问题,因为一类中蛋白质的数量通常远小于该类之外蛋白质的数量。结果,不平衡数据导致分类器倾向于过拟合,并且在少数类上表现不佳。本文提出了一种处理不平衡数据的蛋白质分类新技术。首先,我们提出一种新算法,通过一种新的采样技术和一个分类器委员会来克服蛋白质分类中的不平衡问题。然后,将在不同特征空间中训练的分类器组合在一起,以进一步提高蛋白质分类的准确性。在基准数据集上的数值实验显示了有希望的结果,这证实了所提方法在准确性方面的有效性。Matlab代码和补充材料可在http://eserver2.sat.iis.u-tokyo.ac.jp/ approximately xmzhao/proteins.html获取。