Drotár P, Gazda J, Smékal Z
Department of Telecommunications, Brno University of Technology, Technická 12, 61200 Brno, Czech Republic.
Department of Computers and Informatics, Technical University of Kosice, Letna 9, 0401 Kosice, Slovakia.
Comput Biol Med. 2015 Nov 1;66:1-10. doi: 10.1016/j.compbiomed.2015.08.010. Epub 2015 Aug 24.
Feature selection is a significant part of many machine learning applications dealing with small-sample and high-dimensional data. Choosing the most important features is an essential step for knowledge discovery in many areas of biomedical informatics. The increased popularity of feature selection methods and their frequent utilisation raise challenging new questions about the interpretability and stability of feature selection techniques. In this study, we compared the behaviour of ten state-of-the-art filter methods for feature selection in terms of their stability, similarity, and influence on prediction performance. All of the experiments were conducted on eight two-class datasets from biomedical areas. While entropy-based feature selection appears to be the most stable, the feature selection techniques yielding the highest prediction performance are minimum redundance maximum relevance method and feature selection based on Bhattacharyya distance. In general, univariate feature selection techniques perform similarly to or even better than more complex multivariate feature selection techniques with high-dimensional datasets. However, with more complex and smaller datasets multivariate methods slightly outperform univariate techniques.
特征选择是许多处理小样本和高维数据的机器学习应用的重要组成部分。选择最重要的特征是生物医学信息学许多领域中知识发现的关键步骤。特征选择方法的日益普及及其频繁使用引发了关于特征选择技术的可解释性和稳定性的具有挑战性的新问题。在本研究中,我们比较了十种用于特征选择的先进过滤方法在稳定性、相似性以及对预测性能的影响方面的表现。所有实验均在来自生物医学领域的八个二类数据集上进行。虽然基于熵的特征选择似乎是最稳定的,但产生最高预测性能的特征选择技术是最小冗余最大相关方法和基于 Bhattacharyya 距离的特征选择。一般来说,单变量特征选择技术在高维数据集上的表现与更复杂的多变量特征选择技术相似,甚至更好。然而,对于更复杂且更小的数据集,多变量方法略优于单变量技术。