Suppr超能文献

基于计数的误差估计器导致的关联问题及其对基因选择算法的影响。

The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms.

作者信息

Zhou Xin, Mao K Z

机构信息

School of Electrical & Electronic Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore.

出版信息

Bioinformatics. 2006 Oct 15;22(20):2507-15. doi: 10.1093/bioinformatics/btl438. Epub 2006 Aug 14.

Abstract

MOTIVATION

Feature selection approaches, such as filter and wrapper, have been applied to address the gene selection problem in the literature of microarray data analysis. In wrapper methods, the classification error is usually used as the evaluation criterion of feature subsets. Due to the nature of high dimensionality and small sample size of microarray data, however, counting-based error estimation may not necessarily be an ideal criterion for gene selection problem.

RESULTS

Our study reveals that evaluating genes in terms of counting-based error estimators such as resubstitution error, leave-one-out error, cross-validation error and bootstrap error may encounter severe ties problem, i.e. two or more gene subsets score equally, and this in turn results in uncertainty in gene selection. Our analysis finds that the ties problem is caused by the discrete nature of counting-based error estimators and could be avoided by using continuous evaluation criteria instead. Experiment results show that continuous evaluation criteria such as generalised the absolute value of w2 measure for support vector machines and modified Relief's measure for k-nearest neighbors produce improved gene selection compared with counting-based error estimators.

AVAILABILITY

The companion website is at http://www.ntu.edu.sg/home5/pg02776030/wrappers/ The website contains (1) the source code of all the gene selection algorithms and (2) the complete set of tables and figures of experiments.

摘要

动机

特征选择方法,如过滤法和包装法,已被应用于解决微阵列数据分析文献中的基因选择问题。在包装法中,分类错误通常被用作特征子集的评估标准。然而,由于微阵列数据的高维性和小样本量的特性,基于计数的错误估计不一定是基因选择问题的理想标准。

结果

我们的研究表明,用基于计数的错误估计器(如再代入错误、留一法错误、交叉验证错误和自助法错误)来评估基因可能会遇到严重的平局问题,即两个或更多的基因子集得分相同,这反过来又导致基因选择的不确定性。我们的分析发现,平局问题是由基于计数的错误估计器的离散性质引起的,通过使用连续评估标准可以避免。实验结果表明,与基于计数的错误估计器相比,连续评估标准(如支持向量机的广义w2度量绝对值和k近邻的改进Relief度量)能产生更好的基因选择效果。

可用性

配套网站为http://www.ntu.edu.sg/home5/pg02776030/wrappers/ 该网站包含(1)所有基因选择算法的源代码,以及(2)完整的实验表格和图表集。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验