Inza Iñaki, Larrañaga Pedro, Blanco Rosa, Cerrolaza Antonio J
Department of Computer Science and Artificial Intelligence, University of the Basque Country, P.O. Box 649, E-20080 Donostia-San Sebastián, Basque Country, Spain.
Artif Intell Med. 2004 Jun;31(2):91-103. doi: 10.1016/j.artmed.2004.01.007.
DNA microarray experiments generating thousands of gene expression measurements, are used to collect information from tissue and cell samples regarding gene expression differences that could be useful for diagnosis disease, distinction of the specific tumor type, etc. One important application of gene expression microarray data is the classification of samples into known categories. As DNA microarray technology measures the gene expression en masse, this has resulted in data with the number of features (genes) far exceeding the number of samples. As the predictive accuracy of supervised classifiers that try to discriminate between the classes of the problem decays with the existence of irrelevant and redundant features, the necessity of a dimensionality reduction process is essential. We propose the application of a gene selection process, which also enables the biology researcher to focus on promising gene candidates that actively contribute to classification in these large scale microarrays. Two basic approaches for feature selection appear in machine learning and pattern recognition literature: the filter and wrapper techniques. Filter procedures are used in most of the works in the area of DNA microarrays. In this work, a comparison between a group of different filter metrics and a wrapper sequential search procedure is carried out. The comparison is performed in two well-known DNA microarray datasets by the use of four classic supervised classifiers. The study is carried out over the original-continuous and three-intervals discretized gene expression data. While two well-known filter metrics are proposed for continuous data, four classic filter measures are used over discretized data. The same wrapper approach is used for both continuous and discretized data. The application of filter and wrapper gene selection procedures leads to considerably better accuracy results in comparison to the non-gene selection approach, coupled with interesting and notable dimensionality reductions. Although the wrapper approach mainly shows a more accurate behavior than filter metrics, this improvement is coupled with considerable computer-load necessities. We note that most of the genes selected by proposed filter and wrapper procedures in discrete and continuous microarray data appear in the lists of relevant-informative genes detected by previous studies over these datasets. The aim of this work is to make contributions in the field of the gene selection task in DNA microarray datasets. By an extensive comparison with more popular filter techniques, we would like to make contributions in the expansion and study of the wrapper approach in this type of domains.
DNA微阵列实验可生成数千个基因表达测量值,用于从组织和细胞样本中收集有关基因表达差异的信息,这些差异可能有助于疾病诊断、特定肿瘤类型的区分等。基因表达微阵列数据的一个重要应用是将样本分类到已知类别中。由于DNA微阵列技术可大规模测量基因表达,这导致数据的特征数量(基因)远远超过样本数量。由于试图区分问题类别的监督分类器的预测准确性会随着无关和冗余特征的存在而下降,因此降维过程是必不可少的。我们提出应用基因选择过程,这也使生物学研究人员能够专注于在这些大规模微阵列中对分类有积极贡献的有前景的基因候选物。机器学习和模式识别文献中出现了两种基本的特征选择方法:过滤和包装技术。过滤程序在DNA微阵列领域的大多数工作中都有使用。在这项工作中,对一组不同的过滤指标和一个包装顺序搜索程序进行了比较。通过使用四个经典的监督分类器,在两个著名的DNA微阵列数据集上进行了比较。该研究是针对原始连续和三个区间离散化后的基因表达数据进行的。对于连续数据,提出了两个著名的过滤指标,而对于离散化数据,则使用了四个经典的过滤措施。连续和离散化数据都使用相同的包装方法。与非基因选择方法相比,过滤和包装基因选择程序的应用导致了显著更好的准确性结果,同时伴随着有趣且显著的降维。尽管包装方法主要表现出比过滤指标更准确的行为,但这种改进伴随着相当大的计算机负载需求。我们注意到,在离散和连续微阵列数据中,通过提出的过滤和包装程序选择的大多数基因都出现在先前对这些数据集的研究检测到的相关信息基因列表中。这项工作的目的是在DNA微阵列数据集的基因选择任务领域做出贡献。通过与更流行的过滤技术进行广泛比较,我们希望在这类领域中对包装方法的扩展和研究做出贡献。