Yao Dengju, Yang Jing, Zhan Xiaojuan, Zhan Xiaorong, Xie Zhiqiang
Int J Data Min Bioinform. 2015;13(1):84-101. doi: 10.1504/ijdmb.2015.070852.
High-dimensional data and a large number of redundancy features in bioinformatics research have created an urgent need for feature selection. In this paper, a novel random forests-based feature selection method is proposed that adopts the idea of stratifying feature space and combines generalised sequence backward searching and generalised sequence forward searching strategies. A random forest variable importance score is used to rank features, and different classifiers are used as a feature subset evaluating function. The proposed method is examined on five microarray expression datasets, including leukaemia, prostate, breast, nervous and DLBCL, and the average accuracies of the SVM classifier in these datasets are 100%, 95.24%, 85%, 91.67%, and 91.67%, respectively. The results show that the proposed method could not only improve the classification accuracy but also greatly reduce the computation time of the feature selection process.
生物信息学研究中的高维数据和大量冗余特征催生了对特征选择的迫切需求。本文提出了一种基于随机森林的新型特征选择方法,该方法采用特征空间分层的思想,结合广义序列后向搜索和广义序列前向搜索策略。使用随机森林变量重要性得分对特征进行排序,并使用不同的分类器作为特征子集评估函数。在包括白血病、前列腺癌、乳腺癌、神经和弥漫性大B细胞淋巴瘤在内的五个微阵列表达数据集上对所提出的方法进行了检验,这些数据集中支持向量机分类器的平均准确率分别为100%、95.24%、85%、91.67%和91.67%。结果表明,所提出的方法不仅可以提高分类准确率,还可以大大减少特征选择过程的计算时间。