基于LS边界的DNA微阵列数据基因选择

LS Bound based gene selection for DNA microarray data.

作者信息

Zhou Xin, Mao K Z

机构信息

School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang avenue, Singapore 639798.

出版信息

Bioinformatics. 2005 Apr 15;21(8):1559-64. doi: 10.1093/bioinformatics/bti216. Epub 2004 Dec 14.

DOI:10.1093/bioinformatics/bti216

PMID:15598834

Abstract

MOTIVATION

One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets.

RESULTS

We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method.

AVAILABILITY

A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9).

CONTACT

ekzmao@ntu.edu.sg.

摘要

动机

DNA微阵列数据的判别分析存在一个问题，即每个样本由相当多的基因表示，其中许多基因与手头的判别问题无关、不重要或冗余。因此，选择重要基因的方法在微阵列数据分析中具有重要意义。在本研究中，提出了一种新的准则，称为LS边界度量，以解决基因选择问题。LS边界度量源自最小二乘支持向量机（LS-SVMs）的留一法过程，作为留一法分类结果的上界，它在一定程度上反映了基因子集的泛化性能。

结果

我们将这种LS边界度量应用于两个基准微阵列数据集（结肠癌和白血病数据集）的基因选择。我们还将LS边界度量与其他评估标准进行了比较，包括著名的费希尔比率和马氏距离类可分性度量，以及其他已发表的基因选择算法，包括加权因子和支持向量机递归特征消除算法。LS边界度量的优势在于，它提供的基因子集能带来比过滤方法更准确的分类结果，同时其计算复杂度与过滤方法相当。