Interdisciplinary Center for Biotechnology Research, University of Florida, PO Box 103622, Gainesville, FL 32610-3622, USA.
IEEE Trans Pattern Anal Mach Intell. 2010 Sep;32(9):1610-26. doi: 10.1109/TPAMI.2009.190.
This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature-selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity, and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on a personal computer while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analyses of the algorithm's sample complexity suggest that the algorithm has a logarithmical sample complexity with respect to the number of features. Experiments on 11 synthetic and real-world data sets demonstrate the viability of our formulation of the feature-selection problem for supervised learning and the effectiveness of our algorithm.
本文考虑在存在大量不相关特征的情况下对数据分类进行特征选择。我们提出了一种新的特征选择算法,解决了先前工作中的几个主要问题,包括算法实现、计算复杂度和解决方案准确性方面的问题。其关键思想是通过局部学习将任意复杂的非线性问题分解为一组局部线性问题,然后在大间隔框架内全局学习特征相关性。所提出的算法基于成熟的机器学习和数值分析技术,而不针对基础数据分布做出任何假设。它能够在个人计算机上在几分钟内处理数千个特征,同时保持极高的准确性,几乎不受不断增加的不相关特征的影响。对算法样本复杂度的理论分析表明,该算法的样本复杂度对数与特征数量呈对数关系。在 11 个合成和真实数据集上的实验证明了我们对监督学习中特征选择问题的表述的可行性和我们算法的有效性。