Ding Junyao, Du Jianchao, Wang Hejie, Xiao Song
School of Telecommunications Engineering, Xidian University, Xi'an, 710071, China.
Beijing Electronic Science and Technology Institute, Beijing, 100070, China.
Sci Rep. 2025 May 14;15(1):16828. doi: 10.1038/s41598-025-01761-1.
The data acquisition methods are becoming increasingly diverse and advanced, leading to higher data dimensions, blurred classification boundaries, and overfitting datasets, affecting machine learning models' accuracy. Many studies have sought to improve model performance through feature selection. However, a single feature selection method has incomplete, unstable, or time-consuming shortcomings. Combining the advantages of various feature selection methods can help overcome these defects. This paper proposes a two-stage feature selection method based on random forest and improved genetic algorithm. First, the importance scores of the random forest are calculated and ranked, and the features are preliminarily eliminated according to the scores, reducing the time complexity of the subsequent process. Then, the improved genetic algorithm is used to search for the global optimal feature subset further. This process introduces a multi-objective fitness function to guide the feature subset, minimizing the number of features in the subset while enhancing classification accuracy. This paper also adds an adaptive mechanism and evolution strategy to improve the loss of population diversity and degeneration in the later stages of iteration, thereby enhancing search efficiency. The experimental results on eight UCI datasets show that the proposed method significantly improves classification performance and has excellent feature selection capability.
数据采集方法日益多样化和先进,导致数据维度更高、分类边界模糊以及数据集过拟合,影响机器学习模型的准确性。许多研究试图通过特征选择来提高模型性能。然而,单一的特征选择方法存在不完整、不稳定或耗时的缺点。结合各种特征选择方法的优点有助于克服这些缺陷。本文提出了一种基于随机森林和改进遗传算法的两阶段特征选择方法。首先,计算并排列随机森林的重要性得分,根据得分初步消除特征,降低后续过程的时间复杂度。然后,使用改进的遗传算法进一步搜索全局最优特征子集。此过程引入多目标适应度函数来指导特征子集,在增加分类准确率的同时最小化子集中的特征数量。本文还添加了自适应机制和进化策略,以改善迭代后期种群多样性的损失和退化,从而提高搜索效率。在八个UCI数据集上的实验结果表明,所提出的方法显著提高了分类性能,并具有出色的特征选择能力。