Golugula Abhishek, Lee George, Madabhushi Anant
Department of Electrical and Computer Engineering, Rutgers University, Piscataway, New Jersey 08854, USA.
Annu Int Conf IEEE Eng Med Biol Soc. 2011;2011:949-52. doi: 10.1109/IEMBS.2011.6090214.
In this work, we analyze and evaluate different strategies for comparing Feature Selection (FS) schemes on High Dimensional (HD) biomedical datasets (e.g. gene and protein expression studies) with a small sample size (SSS). Additionally, we define a new feature, Robustness, specifically for comparing the ability of an FS scheme to be invariant to changes in its training data. While classifier accuracy has been the de facto method for evaluating FS schemes, on account of the curse of dimensionality problem, it might not always be the appropriate measure for HD/SSS datasets. SSS lends the dataset a higher probability of containing data that is not representative of the true distribution of the whole population. However, an ideal FS scheme must be robust enough to produce the same results each time there are changes to the training data. In this study, we employed the robustness performance measure in conjunction with classifier accuracy (measured via the K-Nearest Neighbor and Random Forest classifiers) to quantitatively compare five different FS schemes (T-test, F-test, Kolmogorov-Smirnov Test, Wilks Lambda Test and Wilcoxon Rand Sum Test) on 5 HD/SSS gene and protein expression datasets corresponding to ovarian cancer, lung cancer, bone lesions, celiac disease, and coronary heart disease. Of the five FS schemes compared, the Wilcoxon Rand Sum Test was found to outperform other FS schemes in terms of classification accuracy and robustness. Our results suggest that both classifier accuracy and robustness should be considered when deciding on the appropriate FS scheme for HD/SSS datasets.
在这项工作中,我们分析和评估了不同的策略,用于在小样本量(SSS)的高维(HD)生物医学数据集(例如基因和蛋白质表达研究)上比较特征选择(FS)方案。此外,我们定义了一个新的特征——稳健性,专门用于比较FS方案对其训练数据变化的不变性能力。虽然分类器准确性一直是评估FS方案的实际方法,但由于维度诅咒问题,它可能并不总是适用于HD/SSS数据集的度量标准。小样本量使得数据集中更有可能包含不代表总体真实分布的数据。然而,一个理想的FS方案必须足够稳健,以便在每次训练数据发生变化时都能产生相同的结果。在本研究中,我们将稳健性性能度量与分类器准确性(通过K近邻和随机森林分类器测量)结合使用,以定量比较五种不同的FS方案(T检验、F检验、柯尔莫哥洛夫-斯米尔诺夫检验、威尔克斯lambda检验和威尔科克森秩和检验)在对应于卵巢癌、肺癌、骨病变、乳糜泻和冠心病的5个HD/SSS基因和蛋白质表达数据集上的表现。在比较的五种FS方案中,发现威尔科克森秩和检验在分类准确性和稳健性方面优于其他FS方案。我们的结果表明,在为HD/SSS数据集确定合适的FS方案时,应同时考虑分类器准确性和稳健性。