Pérez Noel Pérez, Guevara López Miguel A, Silva Augusto, Ramos Isabel
Institute of Mechanical Engineering and Industrial Management (INEGI), Campus da FEUP, Rua Dr. Roberto Frias, 400, 4200-465 Porto, Portugal.
Institute of Electronics and Telematics Engineering of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal; Institute of Mechanical Engineering and Industrial Management (INEGI), Campus da FEUP, Rua Dr. Roberto Frias, 400, 4200-465 Porto, Portugal.
Artif Intell Med. 2015 Jan;63(1):19-31. doi: 10.1016/j.artmed.2014.12.004. Epub 2014 Dec 12.
This work addresses the theoretical description and experimental evaluation of a new feature selection method (named uFilter). The uFilter improves the Mann-Whitney U-test for reducing dimensionality and ranking features in binary classification problems. Also, it presented a practical uFilter application on breast cancer computer-aided diagnosis (CADx).
A total of 720 datasets (ranked subsets of features) were formed by the application of the chi-square (CHI2) discretization, information-gain (IG), one-rule (1Rule), Relief, uFilter and its theoretical basis method (named U-test). Each produced dataset was used for training feed-forward backpropagation neural network, support vector machine, linear discriminant analysis and naive Bayes machine learning algorithms to produce classification scores for further statistical comparisons.
A head-to-head comparison based on the mean of area under receiver operating characteristics curve scores against the U-test method showed that the uFilter method significantly outperformed the U-test method for almost all classification schemes (p<0.05); it was superior in 50%; tied in a 37.5% and lost in a 12.5% of the 24 comparative scenarios. Also, the performance of the uFilter method, when compared with CHI2 discretization, IG, 1Rule and Relief methods, was superior or at least statistically similar on the explored datasets while requiring less number of features.
The experimental results indicated that uFilter method statistically outperformed the U-test method and it demonstrated similar, but not superior, performance than traditional feature selection methods (CHI2 discretization, IG, 1Rule and Relief). The uFilter method revealed competitive and appealing cost-effectiveness results on selecting relevant features, as a support tool for breast cancer CADx methods especially in unbalanced datasets contexts. Finally, the redundancy analysis as a complementary step to the uFilter method provided us an effective way for finding optimal subsets of features without decreasing the classification performances.
本研究旨在对一种新的特征选择方法(名为uFilter)进行理论描述和实验评估。uFilter改进了曼-惠特尼U检验,用于在二元分类问题中降低维度并对特征进行排序。此外,还展示了uFilter在乳腺癌计算机辅助诊断(CADx)中的实际应用。
通过应用卡方(CHI2)离散化、信息增益(IG)、单规则(1Rule)、Relief、uFilter及其理论基础方法(名为U检验),共形成了720个数据集(特征的排序子集)。每个生成的数据集用于训练前馈反向传播神经网络、支持向量机、线性判别分析和朴素贝叶斯机器学习算法,以产生分类分数用于进一步的统计比较。
基于接收器操作特征曲线下面积分数的均值与U检验方法进行的直接比较表明,在几乎所有分类方案中,uFilter方法均显著优于U检验方法(p<0.05);在24个比较场景中,uFilter方法有50%表现更优,37.5%持平,12.5%表现较差。此外,与CHI2离散化、IG、1Rule和Relief方法相比,uFilter方法在所研究的数据集上表现更优或至少在统计上相似,同时所需的特征数量更少。
实验结果表明,uFilter方法在统计上优于U检验方法,并且与传统特征选择方法(CHI2离散化、IG、1Rule和Relief)相比,表现出相似但不更优的性能。uFilter方法在选择相关特征方面显示出具有竞争力且吸引人的性价比结果,作为乳腺癌CADx方法的支持工具,尤其在不平衡数据集的情况下。最后,冗余分析作为uFilter方法的补充步骤,为我们提供了一种在不降低分类性能的情况下找到最优特征子集的有效方法。