Pochet Nathalie, De Smet Frank, Suykens Johan A K, De Moor Bart L R
ESAT-SCD (SISTA), K.U. Leuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee, Belgium.
Bioinformatics. 2004 Nov 22;20(17):3185-95. doi: 10.1093/bioinformatics/bth383. Epub 2004 Jul 1.
Microarrays are capable of determining the expression levels of thousands of genes simultaneously. In combination with classification methods, this technology can be useful to support clinical management decisions for individual patients, e.g. in oncology. The aim of this paper is to systematically benchmark the role of non-linear versus linear techniques and dimensionality reduction methods.
A systematic benchmarking study is performed by comparing linear versions of standard classification and dimensionality reduction techniques with their non-linear versions based on non-linear kernel functions with a radial basis function (RBF) kernel. A total of 9 binary cancer classification problems, derived from 7 publicly available microarray datasets, and 20 randomizations of each problem are examined.
Three main conclusions can be formulated based on the performances on independent test sets. (1) When performing classification with least squares support vector machines (LS-SVMs) (without dimensionality reduction), RBF kernels can be used without risking too much overfitting. The results obtained with well-tuned RBF kernels are never worse and sometimes even statistically significantly better compared to results obtained with a linear kernel in terms of test set receiver operating characteristic and test set accuracy performances. (2) Even for classification with linear classifiers like LS-SVM with linear kernel, using regularization is very important. (3) When performing kernel principal component analysis (kernel PCA) before classification, using an RBF kernel for kernel PCA tends to result in overfitting, especially when using supervised feature selection. It has been observed that an optimal selection of a large number of features is often an indication for overfitting. Kernel PCA with linear kernel gives better results.
微阵列能够同时测定数千个基因的表达水平。与分类方法相结合,这项技术有助于支持针对个体患者的临床管理决策,例如在肿瘤学领域。本文旨在系统地比较非线性技术与线性技术以及降维方法的作用。
通过将基于径向基函数(RBF)核的非线性核函数的标准分类和降维技术的线性版本与其非线性版本进行比较,开展了一项系统的基准研究。共研究了源自7个公开可用微阵列数据集的9个二元癌症分类问题,以及每个问题的20次随机化。
基于独立测试集的性能可得出三个主要结论。(1)使用最小二乘支持向量机(LS-SVMs)进行分类(不降维)时,可以使用RBF核而不用担心过度拟合风险过大。就测试集接收器操作特性和测试集准确性性能而言,使用经过良好调优的RBF核获得的结果从不比使用线性核获得的结果差,有时甚至在统计学上显著更好。(2)即使对于使用线性核的LS-SVM等线性分类器进行分类,使用正则化也非常重要。(3)在分类前进行核主成分分析(kernel PCA)时,使用RBF核进行核主成分分析往往会导致过度拟合,尤其是在使用监督特征选择时。据观察,大量特征的最优选择往往表明存在过度拟合。使用线性核的核主成分分析能给出更好的结果。