Binghamton University, Binghamton.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):262-72. doi: 10.1109/TCBB.2011.47. Epub 2011 Mar 3.
Feature selection from gene expression microarray data is a widely used technique for selecting candidate genes in various cancer studies. Besides predictive ability of the selected genes, an important aspect in evaluating a selection method is the stability of the selected genes. Experts instinctively have high confidence in the result of a selection method that selects similar sets of genes under some variations to the samples. However, a common problem of existing feature selection methods for gene expression data is that the selected genes by the same method often vary significantly with sample variations. In this work, we propose a general framework of sample weighting to improve the stability of feature selection methods under sample variations. The framework first weights each sample in a given training set according to its influence to the estimation of feature relevance, and then provides the weighted training set to a feature selection method. We also develop an efficient margin-based sample weighting algorithm under this framework. Experiments on a set of microarray data sets show that the proposed algorithm significantly improves the stability of representative feature selection algorithms such as SVM-RFE and ReliefF, without sacrificing their classification performance. Moreover, the proposed algorithm also leads to more stable gene signatures than the state-of-the-art ensemble method, particularly for small signature sizes.
从基因表达微阵列数据中进行特征选择是一种广泛应用的技术,用于在各种癌症研究中选择候选基因。除了所选基因的预测能力外,评估选择方法的一个重要方面是所选基因的稳定性。专家本能地对选择方法的结果充满信心,该方法在对样本进行某些变化时选择相似的基因集。然而,基因表达数据特征选择方法的一个常见问题是,相同方法选择的基因通常随样本变化而显著变化。在这项工作中,我们提出了一种通用的样本加权框架,以提高特征选择方法在样本变化下的稳定性。该框架首先根据特征相关性估计对每个样本的影响对每个样本进行加权,然后将加权训练集提供给特征选择方法。我们还在该框架下开发了一种有效的基于边缘的样本加权算法。在一组微阵列数据集上的实验表明,所提出的算法显著提高了 SVM-RFE 和 ReliefF 等代表性特征选择算法的稳定性,而不会牺牲其分类性能。此外,与最先进的集成方法相比,所提出的算法还产生了更稳定的基因特征,特别是对于较小的特征大小。