Suppr超能文献

用于在随机森林-递归特征消除中自动确定最优特征子集的决策变体

Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE.

作者信息

Chen Qi, Meng Zhaopeng, Liu Xinyi, Jin Qianguo, Su Ran

机构信息

School of Computer Software, Tianjin University, Tianjin 300350, China.

The Military Transportation Command Department, Army Military Transportation University, Tianjin 300361, China.

出版信息

Genes (Basel). 2018 Jun 15;9(6):301. doi: 10.3390/genes9060301.

Abstract

Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.

摘要

特征选择是从原始特征空间中识别出一组信息量最大的特征,已被广泛用于简化预测器。递归特征消除(RFE)作为最流行的特征选择方法之一,在数据降维和提高效率方面很有效。通过RFE可以生成特征排名以及具有相应准确率的候选子集。具有最高准确率(HA)的子集或预设数量的特征(PreNum)通常被用作最终子集。然而,这可能会导致选择大量特征,或者如果对这个预设数量没有先验知识,那么在最终子集选择方面往往是模糊且主观的。因此,迫切需要一种合适的决策变量来自动确定最优子集。在本研究中,我们开展了开创性工作,在从RFE获得候选子集列表后探索决策变量。我们对几种决策变量进行了详细分析和比较,以自动选择最优特征子集。引入了随机森林(RF)-递归特征消除(RF-RFE)算法和投票策略。我们在两个完全不同的分子生物学数据集上验证了这些变量,一个用于毒理基因组学研究,另一个用于蛋白质序列分析。该研究提供了一种在使用RF-RFE时自动确定最优特征子集的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad60/6027449/0ea468f0d911/genes-09-00301-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验