Lee Yuh-Jye, Chang Chien-Chung, Chao Chia-Huang
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan.
J Biopharm Stat. 2008;18(5):827-40. doi: 10.1080/10543400802277868.
In this study, the authors propose a new feature selection scheme, the incremental forward feature selection, which is inspired by incremental reduced support vector machines. In their method, a new feature is added into the current selected feature subset if it will bring in the most extra information. This information is measured by using the distance between the new feature vector and the column space spanned by current feature subset. The incremental forward feature selection scheme can exclude highly linear correlated features that provide redundant information and might degrade the efficiency of learning algorithms. The method is compared with the weight score approach and the 1-norm support vector machine on two well-known microarray gene expression data sets, the acute leukemia and colon cancer data sets. These two data sets have a very few observations but huge number of genes. The linear smooth support vector machine was applied to the feature subsets selected by these three schemes respectively and obtained a slightly better classification results in the 1-norm support vector machine and incremental forward feature selection. Finally, the authors claim that the rest of genes still contain some useful information. The previous selected features are iteratively removed from the data sets and the feature selection and classification steps are repeated for four rounds. The results show that there are many distinct feature subsets that can provide enough information for classification tasks in these two microarray gene expression data sets.
在本研究中,作者提出了一种新的特征选择方案——增量前向特征选择,该方案受到增量约简支持向量机的启发。在他们的方法中,如果一个新特征能带来最多的额外信息,就将其添加到当前选定的特征子集中。此信息通过新特征向量与当前特征子集所张成的列空间之间的距离来衡量。增量前向特征选择方案可以排除提供冗余信息且可能降低学习算法效率的高度线性相关特征。该方法在两个著名的微阵列基因表达数据集——急性白血病和结肠癌数据集上,与权重评分方法和1 -范数支持向量机进行了比较。这两个数据集观测值很少,但基因数量众多。将线性平滑支持向量机分别应用于这三种方案所选的特征子集,在1 -范数支持向量机和增量前向特征选择中获得了稍好的分类结果。最后,作者声称其余基因仍包含一些有用信息。从数据集中迭代移除先前选定的特征,并将特征选择和分类步骤重复四轮。结果表明,在这两个微阵列基因表达数据集中,有许多不同的特征子集可为分类任务提供足够的信息。