Xin Zekun, Lv Ruhong, Liu Wei, Wang Shenghan, Gao Qiang, Zhang Bao, Sun Guangyu
Department of Urology, Aerospace Center Hospital, Beijing, China.
School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China.
PeerJ Comput Sci. 2024 Jan 4;10:e1768. doi: 10.7717/peerj-cs.1768. eCollection 2024.
Feature selection plays a crucial role in classification tasks as part of the data preprocessing process. Effective feature selection can improve the robustness and interpretability of learning algorithms, and accelerate model learning. However, traditional statistical methods for feature selection are no longer practical in the context of high-dimensional data due to the computationally complex. Ensemble learning, a prominent learning method in machine learning, has demonstrated exceptional performance, particularly in classification problems. To address the issue, we propose a three-stage feature selection algorithm framework for high-dimensional data based on ensemble learning (EFS-GINI). Firstly, highly linearly correlated features are eliminated using the Spearman coefficient. Then, a feature selector based on the F-test is employed for the first stage selection. For the second stage, four feature subsets are formed using mutual information (MI), ReliefF, SURF, and SURF* filters in parallel. The third stage involves feature selection using a combinator based on GINI coefficient. Finally, a soft voting approach is proposed to employ for classification, including decision tree, naive Bayes, support vector machine (SVM), k-nearest neighbors (KNN) and random forest classifiers. To demonstrate the effectiveness and efficiency of the proposed algorithm, eight high-dimensional datasets are used and five feature selection methods are employed to compare with our proposed algorithm. Experimental results show that our method effectively enhances the accuracy and speed of feature selection. Moreover, to explore the biological significance of the proposed algorithm, we apply it on the renal cell carcinoma dataset GSE40435 from the Gene Expression Omnibus database. Two feature genes, NOP2 and NSUN5, are selected by our proposed algorithm. They are directly involved in regulating m5c RNA modification, which reveals the biological importance of EFS-GINI. Through bioinformatics analysis, we shows that m5C-related genes play an important role in the occurrence and progression of renal cell carcinoma, and are expected to become an important marker to predict the prognosis of patients.
特征选择作为数据预处理过程的一部分,在分类任务中起着至关重要的作用。有效的特征选择可以提高学习算法的鲁棒性和可解释性,并加速模型学习。然而,由于计算复杂,传统的特征选择统计方法在高维数据的背景下已不再实用。集成学习是机器学习中一种突出的学习方法,已表现出卓越的性能,尤其是在分类问题中。为了解决这个问题,我们提出了一种基于集成学习的高维数据三阶段特征选择算法框架(EFS-GINI)。首先,使用斯皮尔曼系数消除高度线性相关的特征。然后,采用基于F检验的特征选择器进行第一阶段选择。在第二阶段,使用互信息(MI)、ReliefF、SURF和SURF*过滤器并行形成四个特征子集。第三阶段涉及使用基于基尼系数的组合器进行特征选择。最后,提出了一种软投票方法用于分类,包括决策树、朴素贝叶斯、支持向量机(SVM)、k近邻(KNN)和随机森林分类器。为了证明所提出算法的有效性和效率,使用了八个高维数据集,并采用五种特征选择方法与我们提出的算法进行比较。实验结果表明,我们的方法有效地提高了特征选择的准确性和速度。此外,为了探索所提出算法的生物学意义,我们将其应用于来自基因表达综合数据库的肾细胞癌数据集GSE40435。我们提出的算法选择了两个特征基因,NOP2和NSUN5。它们直接参与调节m5c RNA修饰,这揭示了EFS-GINI的生物学重要性。通过生物信息学分析,我们表明与m5C相关的基因在肾细胞癌的发生和发展中起重要作用,并有望成为预测患者预后的重要标志物。