基于计数的误差估计器导致的关联问题及其对基因选择算法的影响。

The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms.

作者信息

Zhou Xin, Mao K Z

机构信息

School of Electrical & Electronic Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore.

出版信息

Bioinformatics. 2006 Oct 15;22(20):2507-15. doi: 10.1093/bioinformatics/btl438. Epub 2006 Aug 14.

DOI:10.1093/bioinformatics/btl438

PMID:16908500

Abstract

MOTIVATION

Feature selection approaches, such as filter and wrapper, have been applied to address the gene selection problem in the literature of microarray data analysis. In wrapper methods, the classification error is usually used as the evaluation criterion of feature subsets. Due to the nature of high dimensionality and small sample size of microarray data, however, counting-based error estimation may not necessarily be an ideal criterion for gene selection problem.

RESULTS

Our study reveals that evaluating genes in terms of counting-based error estimators such as resubstitution error, leave-one-out error, cross-validation error and bootstrap error may encounter severe ties problem, i.e. two or more gene subsets score equally, and this in turn results in uncertainty in gene selection. Our analysis finds that the ties problem is caused by the discrete nature of counting-based error estimators and could be avoided by using continuous evaluation criteria instead. Experiment results show that continuous evaluation criteria such as generalised the absolute value of w2 measure for support vector machines and modified Relief's measure for k-nearest neighbors produce improved gene selection compared with counting-based error estimators.

AVAILABILITY

The companion website is at http://www.ntu.edu.sg/home5/pg02776030/wrappers/ The website contains (1) the source code of all the gene selection algorithms and (2) the complete set of tables and figures of experiments.

摘要

动机

特征选择方法，如过滤法和包装法，已被应用于解决微阵列数据分析文献中的基因选择问题。在包装法中，分类错误通常被用作特征子集的评估标准。然而，由于微阵列数据的高维性和小样本量的特性，基于计数的错误估计不一定是基因选择问题的理想标准。

结果

我们的研究表明，用基于计数的错误估计器（如再代入错误、留一法错误、交叉验证错误和自助法错误）来评估基因可能会遇到严重的平局问题，即两个或更多的基因子集得分相同，这反过来又导致基因选择的不确定性。我们的分析发现，平局问题是由基于计数的错误估计器的离散性质引起的，通过使用连续评估标准可以避免。实验结果表明，与基于计数的错误估计器相比，连续评估标准（如支持向量机的广义w2度量绝对值和k近邻的改进Relief度量）能产生更好的基因选择效果。

可用性

配套网站为http://www.ntu.edu.sg/home5/pg02776030/wrappers/ 该网站包含（1）所有基因选择算法的源代码，以及（2）完整的实验表格和图表集。

相似文献

The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms.

Bioinformatics. 2006 Oct 15;22(20):2507-15. doi: 10.1093/bioinformatics/btl438. Epub 2006 Aug 14.

What should be expected from feature selection in small-sample settings.

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

Classification based upon gene expression data: bias and precision of error rates.

Bioinformatics. 2007 Jun 1;23(11):1363-70. doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28.

Prediction error estimation: a comparison of resampling methods.

Bioinformatics. 2005 Aug 1;21(15):3301-7. doi: 10.1093/bioinformatics/bti499. Epub 2005 May 19.

Genetic test bed for feature selection.

Bioinformatics. 2006 Apr 1;22(7):837-42. doi: 10.1093/bioinformatics/btl008. Epub 2006 Jan 20.

Optimal number of features as a function of sample size for various classification rules.

Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.

Filter versus wrapper gene selection approaches in DNA microarray domains.

Artif Intell Med. 2004 Jun;31(2):91-103. doi: 10.1016/j.artmed.2004.01.007.

Reliable gene signatures for microarray classification: assessment of stability and performance.

Bioinformatics. 2006 Oct 1;22(19):2356-63. doi: 10.1093/bioinformatics/btl400. Epub 2006 Jul 31.

Gene selection based on multi-class support vector machines and genetic algorithms.

Genet Mol Res. 2005 Sep 30;4(3):599-607.

Gene selection in cancer classification using sparse logistic regression with Bayesian regularization.

Bioinformatics. 2006 Oct 1;22(19):2348-55. doi: 10.1093/bioinformatics/btl386. Epub 2006 Jul 14.

引用本文的文献

Effective classification and gene expression profiling for the Facioscapulohumeral Muscular Dystrophy.

PLoS One. 2013 Dec 13;8(12):e82071. doi: 10.1371/journal.pone.0082071. eCollection 2013.

Analyzing kernel matrices for the identification of differentially expressed genes.

PLoS One. 2013 Dec 9;8(12):e81683. doi: 10.1371/journal.pone.0081683. eCollection 2013.

Predicting antitumor activity of peptides by consensus of regression models trained on a small data sample.

Int J Mol Sci. 2011;12(12):8415-30. doi: 10.3390/ijms12128415. Epub 2011 Nov 29.

Validation of computational methods in genomics.

Curr Genomics. 2007 Mar;8(1):1-19. doi: 10.2174/138920207780076956.

Decorrelation of the true and estimated classifier errors in high-dimensional settings.

EURASIP J Bioinform Syst Biol. 2007;2007(1):38473. doi: 10.1155/2007/38473.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于计数的误差估计器导致的关联问题及其对基因选择算法的影响。

The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献