Radivojac Predrag, Chawla Nitesh V, Dunker A Keith, Obradovic Zoran
Center for Information Science and Technology, Temple University, USA.
J Biomed Inform. 2004 Aug;37(4):224-39. doi: 10.1016/j.jbi.2004.07.008.
We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.
我们考虑在有噪声、高维且类别不平衡的蛋白质数据集中进行分类的问题。为了设计一个完整的分类系统,我们使用了一个三阶段机器学习框架,该框架由一个特征选择阶段、一种处理噪声和类别不平衡的方法,以及一种通过基于先验知识的聚类来组合生物学相关任务的方法组成。在第一阶段,我们采用Fisher排列检验作为特征选择过滤器。与其他标准的比较表明,它可能适用于典型的蛋白质数据集。在第二阶段,通过使用少数类过采样、多数类欠采样和集成学习来处理噪声和类别不平衡。系统地评估了逻辑回归模型、决策树和神经网络的性能。实验结果表明,在许多情况下,逻辑回归分类器的集成可能由于其对噪声的鲁棒性以及在高维特征空间中的低样本密度而优于更具表现力的模型。然而,神经网络的集成可能是大型数据集的最佳解决方案。在第三阶段,我们使用先验知识对未标记数据进行划分,使得非重叠聚类之间的类别分布有显著差异。在我们的实验中,针对每个聚类的类别分布训练专门的分类器导致分类误差进一步降低。