Han Yuanyuan, Huang Lan, Zhou Fengfeng
College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China.
Bioinformatics. 2021 Aug 9;37(15):2183-2189. doi: 10.1093/bioinformatics/btab055.
A feature selection algorithm may select the subset of features with the best associations with the class labels. The recursive feature elimination (RFE) is a heuristic feature screening framework and has been widely used to select the biological OMIC biomarkers. This study proposed a dynamic recursive feature elimination (dRFE) framework with more flexible feature elimination operations. The proposed dRFE was comprehensively compared with 11 existing feature selection algorithms and five classifiers on the eight difficult transcriptome datasets from a previous study, the ten newly collected transcriptome datasets and the five methylome datasets.
The experimental data suggested that the regular RFE framework did not perform well, and dRFE outperformed the existing feature selection algorithms in most cases. The dRFE-detected features achieved Acc = 1.0000 for the two methylome datasets GSE53045 and GSE66695. The best prediction accuracies of the dRFE-detected features were 0.9259, 0.9424 and 0.8601 for the other three methylome datasets GSE74845, GSE103186 and GSE80970, respectively. Four transcriptome datasets received Acc = 1.0000 using the dRFE-detected features, and the prediction accuracies for the other six newly collected transcriptome datasets were between 0.6301 and 0.9917.
The experiments in this study are implemented and tested using the programming language Python version 3.7.6.
Supplementary data are available at Bioinformatics online.
特征选择算法可以选择与类别标签具有最佳关联的特征子集。递归特征消除(RFE)是一种启发式特征筛选框架,已被广泛用于选择生物组学生物标志物。本研究提出了一种具有更灵活特征消除操作的动态递归特征消除(dRFE)框架。将提出的dRFE与先前研究中的8个困难转录组数据集、10个新收集的转录组数据集和5个甲基化组数据集上的11种现有特征选择算法和5种分类器进行了全面比较。
实验数据表明,常规RFE框架表现不佳,dRFE在大多数情况下优于现有特征选择算法。对于两个甲基化组数据集GSE53045和GSE66695,dRFE检测到的特征的Acc = 1.0000。对于其他三个甲基化组数据集GSE74845、GSE103186和GSE80970,dRFE检测到的特征的最佳预测准确率分别为0.9259、0.9424和0.8601。使用dRFE检测到的特征,四个转录组数据集的Acc = 1.0000,其他六个新收集的转录组数据集的预测准确率在0.6301和0.9917之间。
本研究中的实验使用编程语言Python版本3.7.6进行实现和测试。
补充数据可在《生物信息学》在线获取。