Sun Lei, Wang Jun, Wei Jinmao
Institute of Big Data, College of Computer and Control Engineering, Nankai University, 38 Tongyan Road, Tianjin, 300350, China.
BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):50. doi: 10.1186/s12859-017-1468-4.
The Receiver Operator Characteristic (ROC) curve is well-known in evaluating classification performance in biomedical field. Owing to its superiority in dealing with imbalanced and cost-sensitive data, the ROC curve has been exploited as a popular metric to evaluate and find out disease-related genes (features). The existing ROC-based feature selection approaches are simple and effective in evaluating individual features. However, these approaches may fail to find real target feature subset due to their lack of effective means to reduce the redundancy between features, which is essential in machine learning.
In this paper, we propose to assess feature complementarity by a trick of measuring the distances between the misclassified instances and their nearest misses on the dimensions of pairwise features. If a misclassified instance and its nearest miss on one feature dimension are far apart on another feature dimension, the two features are regarded as complementary to each other. Subsequently, we propose a novel filter feature selection approach on the basis of the ROC analysis. The new approach employs an efficient heuristic search strategy to select optimal features with highest complementarities. The experimental results on a broad range of microarray data sets validate that the classifiers built on the feature subset selected by our approach can get the minimal balanced error rate with a small amount of significant features.
Compared with other ROC-based feature selection approaches, our new approach can select fewer features and effectively improve the classification performance.
在生物医学领域评估分类性能时,接收者操作特征(ROC)曲线广为人知。由于其在处理不平衡和成本敏感数据方面的优势,ROC曲线已被用作评估和找出疾病相关基因(特征)的常用指标。现有的基于ROC的特征选择方法在评估单个特征时简单有效。然而,由于缺乏减少特征间冗余的有效手段,这些方法可能无法找到真正的目标特征子集,而特征间冗余在机器学习中至关重要。
在本文中,我们提出通过一种技巧来评估特征互补性,即测量成对特征维度上误分类实例与其最近的未命中实例之间的距离。如果一个误分类实例与其在一个特征维度上最近的未命中实例在另一个特征维度上相距很远,则认为这两个特征相互互补。随后,我们基于ROC分析提出了一种新颖的过滤特征选择方法。新方法采用高效的启发式搜索策略来选择具有最高互补性的最优特征。在广泛的微阵列数据集上的实验结果验证了基于我们方法选择的特征子集构建的分类器能够以少量显著特征获得最小的平衡错误率。
与其他基于ROC的特征选择方法相比,我们的新方法能够选择更少的特征并有效提高分类性能。