Baralis Elena, Bruno Giulia, Fiori Alessandro
Politecnico di Torino, Italy.
Annu Int Conf IEEE Eng Med Biol Soc. 2008;2008:5692-5. doi: 10.1109/IEMBS.2008.4650506.
A fundamental problem in microarray analysis is to identify relevant genes from large amounts of expression data. Feature selection aims at identifying a subset of features for building robust learning models. However, finding the optimal number of features is a challenging problem, as it is a trade off between information loss when pruning excessively and noise increase when pruning is too weak. This paper presents a novel representation of genes as strings of bits and a method which automatically selects the minimum number of genes to reach a good classification accuracy on the training set. Our method first eliminates redundant features, which do not add further information for classification, then it exploits a set covering algorithm. Preliminary experimental results on public datasets confirm the intuition of the proposed method leading to high classification accuracy.
微阵列分析中的一个基本问题是从大量表达数据中识别相关基因。特征选择旨在识别用于构建稳健学习模型的特征子集。然而,找到最优的特征数量是一个具有挑战性的问题,因为这是在过度删减时的信息损失与删减不足时的噪声增加之间进行权衡。本文提出了一种将基因表示为位串的新颖方法以及一种在训练集上自动选择最少数量的基因以达到良好分类准确率的方法。我们的方法首先消除冗余特征,即那些不会为分类增添更多信息的特征,然后利用一种集合覆盖算法。在公共数据集上的初步实验结果证实了所提方法能带来高分类准确率的直觉。