微阵列数据分类的系统基准测试：评估非线性和降维的作用。

Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction.

作者信息

Pochet Nathalie, De Smet Frank, Suykens Johan A K, De Moor Bart L R

机构信息

ESAT-SCD (SISTA), K.U. Leuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee, Belgium.

出版信息

Bioinformatics. 2004 Nov 22;20(17):3185-95. doi: 10.1093/bioinformatics/bth383. Epub 2004 Jul 1.

DOI:10.1093/bioinformatics/bth383

PMID:15231531

Abstract

MOTIVATION

Microarrays are capable of determining the expression levels of thousands of genes simultaneously. In combination with classification methods, this technology can be useful to support clinical management decisions for individual patients, e.g. in oncology. The aim of this paper is to systematically benchmark the role of non-linear versus linear techniques and dimensionality reduction methods.

RESULTS

A systematic benchmarking study is performed by comparing linear versions of standard classification and dimensionality reduction techniques with their non-linear versions based on non-linear kernel functions with a radial basis function (RBF) kernel. A total of 9 binary cancer classification problems, derived from 7 publicly available microarray datasets, and 20 randomizations of each problem are examined.

CONCLUSIONS

Three main conclusions can be formulated based on the performances on independent test sets. (1) When performing classification with least squares support vector machines (LS-SVMs) (without dimensionality reduction), RBF kernels can be used without risking too much overfitting. The results obtained with well-tuned RBF kernels are never worse and sometimes even statistically significantly better compared to results obtained with a linear kernel in terms of test set receiver operating characteristic and test set accuracy performances. (2) Even for classification with linear classifiers like LS-SVM with linear kernel, using regularization is very important. (3) When performing kernel principal component analysis (kernel PCA) before classification, using an RBF kernel for kernel PCA tends to result in overfitting, especially when using supervised feature selection. It has been observed that an optimal selection of a large number of features is often an indication for overfitting. Kernel PCA with linear kernel gives better results.

摘要

动机

微阵列能够同时测定数千个基因的表达水平。与分类方法相结合，这项技术有助于支持针对个体患者的临床管理决策，例如在肿瘤学领域。本文旨在系统地比较非线性技术与线性技术以及降维方法的作用。

结果

通过将基于径向基函数（RBF）核的非线性核函数的标准分类和降维技术的线性版本与其非线性版本进行比较，开展了一项系统的基准研究。共研究了源自7个公开可用微阵列数据集的9个二元癌症分类问题，以及每个问题的20次随机化。

结论

基于独立测试集的性能可得出三个主要结论。（1）使用最小二乘支持向量机（LS-SVMs）进行分类（不降维）时，可以使用RBF核而不用担心过度拟合风险过大。就测试集接收器操作特性和测试集准确性性能而言，使用经过良好调优的RBF核获得的结果从不比使用线性核获得的结果差，有时甚至在统计学上显著更好。（2）即使对于使用线性核的LS-SVM等线性分类器进行分类，使用正则化也非常重要。（3）在分类前进行核主成分分析（kernel PCA）时，使用RBF核进行核主成分分析往往会导致过度拟合，尤其是在使用监督特征选择时。据观察，大量特征的最优选择往往表明存在过度拟合。使用线性核的核主成分分析能给出更好的结果。

相似文献

Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction.

Bioinformatics. 2004 Nov 22;20(17):3185-95. doi: 10.1093/bioinformatics/bth383. Epub 2004 Jul 1.

M@CBETH: a microarray classification benchmarking tool.

Bioinformatics. 2005 Jul 15;21(14):3185-6. doi: 10.1093/bioinformatics/bti495. Epub 2005 May 12.

Regularized Least Squares Cancer classifiers from DNA microarray data.

BMC Bioinformatics. 2005 Dec 1;6 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-6-S4-S2.

Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.

BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.

Normalization of microarray data using a spatial mixed model analysis which includes splines.

Bioinformatics. 2004 Nov 22;20(17):3196-205. doi: 10.1093/bioinformatics/bth384. Epub 2004 Jul 1.

Classification of heterogeneous microarray data by maximum entropy kernel.

BMC Bioinformatics. 2007 Jul 26;8:267. doi: 10.1186/1471-2105-8-267.

Comparative study of SVM methods combined with voxel selection for object category classification on fMRI data.

PLoS One. 2011 Feb 16;6(2):e17191. doi: 10.1371/journal.pone.0017191.

New bandwidth selection criterion for Kernel PCA: approach to dimensionality reduction and classification problems.

BMC Bioinformatics. 2014 May 10;15:137. doi: 10.1186/1471-2105-15-137.

Effect of finite sample size on feature selection and classification: a simulation study.

Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.

Applications of support vector machines to cancer classification with microarray data.

Int J Neural Syst. 2005 Dec;15(6):475-84. doi: 10.1142/S0129065705000396.

引用本文的文献

A comprehensive survey on computational learning methods for analysis of gene expression data.

Front Mol Biosci. 2022 Nov 7;9:907150. doi: 10.3389/fmolb.2022.907150. eCollection 2022.

The ability to classify patients based on gene-expression data varies by algorithm and performance metric.

PLoS Comput Biol. 2022 Mar 11;18(3):e1009926. doi: 10.1371/journal.pcbi.1009926. eCollection 2022 Mar.

Discovery of Small-Molecule Activators for Glucose-6-Phosphate Dehydrogenase (G6PD) Using Machine Learning Approaches.

Int J Mol Sci. 2020 Feb 23;21(4):1523. doi: 10.3390/ijms21041523.

Reconstruction error based deep neural networks for coronary heart disease risk prediction.

PLoS One. 2019 Dec 5;14(12):e0225991. doi: 10.1371/journal.pone.0225991. eCollection 2019.

Biomarker Discovery for Immunotherapy of Pituitary Adenomas: Enhanced Robustness and Prediction Ability by Modern Computational Tools.

Int J Mol Sci. 2019 Jan 3;20(1):151. doi: 10.3390/ijms20010151.

Accurate and fast feature selection workflow for high-dimensional omics data.

PLoS One. 2017 Dec 20;12(12):e0189875. doi: 10.1371/journal.pone.0189875. eCollection 2017.

Nonlinear dimensionality reduction methods for synthetic biology biobricks' visualization.

BMC Bioinformatics. 2017 Jan 19;18(1):47. doi: 10.1186/s12859-017-1484-4.

Modeling Laterality of the Globus Pallidus Internus in Patients With Parkinson's Disease.

Neuromodulation. 2017 Apr;20(3):238-242. doi: 10.1111/ner.12480. Epub 2016 Jul 28.

Identifying Cancer Biomarkers From Microarray Data Using Feature Selection and Semisupervised Learning.

IEEE J Transl Eng Health Med. 2014 Dec 2;2:4300211. doi: 10.1109/JTEHM.2014.2375820. eCollection 2014.

A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data.

Genom Data. 2016 Feb 23;8:4-15. doi: 10.1016/j.gdata.2016.02.012. eCollection 2016 Jun.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

微阵列数据分类的系统基准测试：评估非线性和降维的作用。

Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

CONCLUSIONS

动机

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献