Wang Changzhong, Hu Qinghua, Wang Xizhao, Chen Degang, Qian Yuhua, Dong Zhe
IEEE Trans Neural Netw Learn Syst. 2018 Jul;29(7):2986-2999. doi: 10.1109/TNNLS.2017.2710422. Epub 2017 Jun 23.
Feature selection is viewed as an important preprocessing step for pattern recognition, machine learning, and data mining. Neighborhood is one of the most important concepts in classification learning and can be used to distinguish samples with different decisions. In this paper, a neighborhood discrimination index is proposed to characterize the distinguishing information of a neighborhood relation. It reflects the distinguishing ability of a feature subset. The proposed discrimination index is computed by considering the cardinality of a neighborhood relation rather than neighborhood similarity classes. Variants of the discrimination index, including joint discrimination index, conditional discrimination index, and mutual discrimination index, are introduced to compute the change of distinguishing information caused by the combination of multiple feature subsets. They have the similar properties as Shannon entropy and its variants. A parameter, named neighborhood radius, is introduced in these discrimination measures to address the analysis of real-valued data. Based on the proposed discrimination measures, the significance measure of a candidate feature is defined and a greedy forward algorithm for feature selection is designed. Data sets selected from public data sources are used to compare the proposed algorithm with existing algorithms. The experimental results confirm that the discrimination index-based algorithm yields superior performance compared to other classical algorithms.
特征选择被视为模式识别、机器学习和数据挖掘的重要预处理步骤。邻域是分类学习中最重要的概念之一,可用于区分具有不同决策的样本。本文提出了一种邻域判别指标来表征邻域关系的区分信息。它反映了特征子集的区分能力。所提出的判别指标是通过考虑邻域关系的基数而不是邻域相似类来计算的。引入了判别指标的变体,包括联合判别指标、条件判别指标和互判别指标,以计算多个特征子集组合引起的区分信息变化。它们具有与香农熵及其变体相似的性质。在这些判别度量中引入了一个名为邻域半径的参数,以处理实值数据的分析。基于所提出的判别度量,定义了候选特征的显著性度量,并设计了一种贪婪前向特征选择算法。从公共数据源中选择的数据集用于将所提出的算法与现有算法进行比较。实验结果证实,基于判别指标的算法比其他经典算法具有更好的性能。