Ma Li, Fan Suohai
School of Information Science and Technology, Jinan University, Guangzhou, 510632, China.
BMC Bioinformatics. 2017 Mar 14;18(1):169. doi: 10.1186/s12859-017-1578-z.
The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization.
We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability.
The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.
随机森林算法是一种具有突出通用性、广泛应用范围且能有效避免过拟合的稳健性分类器。但随机森林仍存在一些缺点。因此,为提高随机森林的性能,本文致力于改进不平衡数据处理、特征选择和参数优化。
我们针对不平衡数据分类问题提出了CURE-SMOTE算法。在不平衡的UCI数据上进行的实验表明,与使用随机采样、Borderline-SMOTE1、安全级SMOTE、C-SMOTE和k均值-SMOTE对原始数据进行分类的结果相比,基于代表点的聚类(CURE)与原始合成少数类过采样技术(SMOTE)算法相结合有效地提高了分类效果。此外,还提出了混合随机森林(RF)算法用于特征选择和参数优化,该算法以袋外(OOB)数据的最小误差作为目标函数。在二元和高维数据上的仿真结果表明,所提出的混合RF算法、混合遗传-随机森林算法、混合粒子群-随机森林算法和混合鱼群-随机森林算法能够实现最小的OOB误差,并展现出最佳的泛化能力。
所提出的CURE-SMOTE算法生成的训练集更接近原始数据分布,因为其包含的噪声最小。因此,该可行且有效的算法能产生更好的分类结果。此外,混合算法的F值、G均值、AUC和OOB分数表明它们优于原始RF算法的性能。因此,这种混合算法为进行特征选择和参数优化提供了一种新方法。