School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China.
School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
Brief Bioinform. 2020 Mar 23;21(2):687-698. doi: 10.1093/bib/bbz021.
Recursive feature elimination (RFE), as one of the most popular feature selection algorithms, has been extensively applied to bioinformatics. During the training, a group of candidate subsets are generated by iteratively eliminating the least important features from the original features. However, how to determine the optimal subset from them still remains ambiguous. Among most current studies, either overall accuracy or subset size (SS) is used to select the most predictive features. Using which one or both and how they affect the prediction performance are still open questions. In this study, we proposed MinE-RFE, a novel RFE-based feature selection approach by sufficiently considering the effect of both factors. Subset decision problem was reflected into subset-accuracy space and became an energy-minimization problem. We also provided a mathematical description of the relationship between the overall accuracy and SS using Gaussian Mixture Models together with spline fitting. Besides, we comprehensively reviewed a variety of state-of-the-art applications in bioinformatics using RFE. We compared their approaches of deciding the final subset from all the candidate subsets with MinE-RFE on diverse bioinformatics data sets. Additionally, we also compared MinE-RFE with some well-used feature selection algorithms. The comparative results demonstrate that the proposed approach exhibits the best performance among all the approaches. To facilitate the use of MinE-RFE, we further established a user-friendly web server with the implementation of the proposed approach, which is accessible at http://qgking.wicp.net/MinE/. We expect this web server will be a useful tool for research community.
递归特征消除(RFE)作为最流行的特征选择算法之一,已被广泛应用于生物信息学领域。在训练过程中,通过从原始特征中迭代地删除最不重要的特征,生成一组候选子集。然而,如何从它们中确定最佳子集仍然不明确。在大多数当前的研究中,要么使用整体准确性,要么使用子集大小(SS)来选择最具预测性的特征。使用哪一个或两者以及它们如何影响预测性能仍然是悬而未决的问题。在这项研究中,我们提出了 MinE-RFE,这是一种基于 RFE 的新型特征选择方法,充分考虑了这两个因素的影响。子集决策问题反映在子集-准确性空间中,并成为一个能量最小化问题。我们还使用高斯混合模型和样条拟合提供了整体准确性和 SS 之间关系的数学描述。此外,我们还全面回顾了生物信息学中使用 RFE 的各种最新应用。我们将 MinE-RFE 与从所有候选子集中最终确定子集的各种方法在不同的生物信息学数据集上进行了比较。此外,我们还将 MinE-RFE 与一些常用的特征选择算法进行了比较。比较结果表明,所提出的方法在所有方法中表现出最好的性能。为了方便使用 MinE-RFE,我们进一步建立了一个用户友好的网络服务器,实现了所提出的方法,该服务器可在 http://qgking.wicp.net/MinE/ 访问。我们希望这个网络服务器将成为研究社区的一个有用工具。