Laboratory of DNA Information Analysis, Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan.
IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):754-64. doi: 10.1109/TCBB.2011.151.
Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection. Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly of size h), then selects informative smaller subsets of genes (of size r < h) from a subset and merges the chosen genes with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene expression data sets. Our method shows promising classification accuracy for all the test data sets. We also show the relevance of the selected genes in terms of their biological functions.
大多数传统的特征选择算法都存在一个缺点,即一个排名较低的基因,如果与适当的基因子集结合,可能在分类准确性方面表现良好,但它可能会被排除在选择之外。考虑到这一缺点,我们提出了一种在样本分类的基因表达数据分析中的特征选择算法。该算法首先将基因分成大小相对较小的子集(大约大小为 h),然后从子集中选择信息量较大的较小子集(大小为 r < h)的基因,并将所选基因与另一个大小为 r 的基因子集合并,以更新基因子集。我们重复这个过程,直到所有的子集合并成一个信息丰富的子集。我们通过分析三个不同的基因表达数据集来说明所提出算法的有效性。我们的方法对所有的测试数据集都显示出了有希望的分类准确性。我们还展示了所选基因在其生物学功能方面的相关性。