National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Anal Chim Acta. 2014 Jan 2;806:117-27. doi: 10.1016/j.aca.2013.10.050. Epub 2013 Nov 6.
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost+SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF+SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem.
从高通量筛选(HTS)中产生的不平衡数据集是很常见的。对于一个没有考虑到不平衡性质的给定数据集,大多数分类方法往往会对大多数类产生很高的预测准确性,但对少数类的性能却显著较差。在这项工作中,我们开发并利用了一种高效的算法 GLMBoost 与合成少数过采样技术(SMOTE)相结合,以克服来自 PubChem BioAssay 的几个不平衡数据集的问题。通过应用所提出的组合方法,可以明显地检测到稀有样本(活性化合物)的数据,这些数据通常会产生较差的结果,并且具有较高的平衡准确性(Gmean)。与 GLMBoost 相比,我们还采用了随机森林(RF)与 SMOTE 相结合来对相同的数据集进行分类。我们的结果表明,前者(GLMBoost+SMOTE)不仅在稀有样本的正确分类百分比(灵敏度)和 Gmean 方面表现出更高的性能,而且比后者(RF+SMOTE)具有更高的计算效率。因此,我们希望基于 GLMBoost 和 SMOTE 的组合算法能够广泛用于解决不平衡分类问题。