Tang E Ke, Suganthan P N, Yao Xin
School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore.
BMC Bioinformatics. 2006 Feb 27;7:95. doi: 10.1186/1471-2105-7-95.
In discriminant analysis of microarray data, usually a small number of samples are expressed by a large number of genes. It is not only difficult but also unnecessary to conduct the discriminant analysis with all the genes. Hence, gene selection is usually performed to select important genes.
A gene selection method searches for an optimal or near optimal subset of genes with respect to a given evaluation criterion. In this paper, we propose a new evaluation criterion, named the leave-one-out calculation (LOOC, A list of abbreviations appears just above the list of references) measure. A gene selection method, named leave-one-out calculation sequential forward selection (LOOCSFS) algorithm, is then presented by combining the LOOC measure with the sequential forward selection scheme. Further, a novel gene selection algorithm, the gradient-based leave-one-out gene selection (GLGS) algorithm, is also proposed. Both of the gene selection algorithms originate from an efficient and exact calculation of the leave-one-out cross-validation error of the least squares support vector machine (LS-SVM). The proposed approaches are applied to two microarray datasets and compared to other well-known gene selection methods using codes available from the second author.
The proposed gene selection approaches can provide gene subsets leading to more accurate classification results, while their computational complexity is comparable to the existing methods. The GLGS algorithm can also better scale to datasets with a very large number of genes.
在微阵列数据的判别分析中,通常少量样本由大量基因来表示。对所有基因进行判别分析不仅困难而且没有必要。因此,通常会进行基因选择以挑选出重要基因。
一种基因选择方法会根据给定的评估标准搜索基因的最优或接近最优子集。在本文中,我们提出了一种新的评估标准,称为留一法计算(LOOC,缩写列表恰好在参考文献列表上方)度量。然后通过将LOOC度量与顺序向前选择方案相结合,提出了一种名为留一法计算顺序向前选择(LOOCSFS)算法的基因选择方法。此外,还提出了一种新颖的基因选择算法,即基于梯度的留一法基因选择(GLGS)算法。这两种基因选择算法均源自对最小二乘支持向量机(LS-SVM)的留一法交叉验证误差的高效且精确的计算。所提出的方法应用于两个微阵列数据集,并使用第二作者提供的代码与其他知名的基因选择方法进行比较。
所提出的基因选择方法能够提供导致更准确分类结果的基因子集,同时其计算复杂度与现有方法相当。GLGS算法在处理具有大量基因的数据集时也能更好地扩展。