高维环境下真实分类器误差与估计分类器误差的去相关。

Decorrelation of the true and estimated classifier errors in high-dimensional settings.

作者信息

Hanczar Blaise, Hua Jianping, Dougherty Edward R

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.

出版信息

EURASIP J Bioinform Syst Biol. 2007;2007(1):38473. doi: 10.1155/2007/38473.

DOI:10.1155/2007/38473

PMID:18288255

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3171336/

Abstract

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.

摘要

许多微阵列实验的目的是构建鉴别诊断和预后模型。鉴于特征数量巨大而样本数量较少，模型有效性（即误差估计的精度）是一个关键问题。先前的研究通过偏差分布（估计误差减去真实误差）来解决这个问题，特别是在使用特征选择来减轻峰值现象（过拟合）的高维环境中交叉验证精度的恶化。由于分类器设计基于随机样本，真实误差和估计误差都是依赖于样本的随机变量，如果估计误差和真实误差没有很好的相关性，人们会预期精度会有所损失，因此会自然产生关于相关性程度以及缺乏相关性影响误差估计的方式等问题。我们通过对偏差分布的方差进行分解来证明相关性对误差精度的影响，观察到在高维环境中相关性通常会严重降低，并表明高维度对误差估计的影响更多地源于其去相关效应，而不是对估计误差方差的影响。我们使用合成数据和真实数据、几种特征选择方法、不同的分类规则以及三种常用的误差估计器（留一法交叉验证、k折交叉验证和.632自举法）来考虑不同实验条件下真实误差和估计误差之间的相关性。此外，考虑了三种情况：（1）特征选择，（2）已知特征集，（3）所有特征。只有第一种情况具有实际意义；然而，为了比较目的需要另外两种情况。我们将观察到，在已知特征集的情况下，真实误差和估计误差的相关性往往比特征选择或使用所有特征的情况更强，后两者之间较好的相关性没有普遍趋势，但因不同模型而异。

相似文献

Decorrelation of the true and estimated classifier errors in high-dimensional settings.高维环境下真实分类器误差与估计分类器误差的去相关。

EURASIP J Bioinform Syst Biol. 2007;2007(1):38473. doi: 10.1155/2007/38473.

Quantification of the impact of feature selection on the variance of cross-validation error estimation.特征选择对交叉验证误差估计方差影响的量化。

EURASIP J Bioinform Syst Biol. 2007;2007(1):16354. doi: 10.1155/2007/16354.

Confidence intervals for the true classification error conditioned on the estimated error.基于估计误差的真实分类误差的置信区间。

Technol Cancer Res Treat. 2006 Dec;5(6):579-89. doi: 10.1177/153303460600500605.

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study.使用偏最小二乘判别分析进行组学数据分析时，交叉验证中的过度乐观：一项系统研究。

Anal Bioanal Chem. 2018 Sep;410(23):5981-5992. doi: 10.1007/s00216-018-1217-1. Epub 2018 Jun 29.

What should be expected from feature selection in small-sample settings.在小样本情况下，特征选择应达到什么预期效果。

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

Is cross-validation valid for small-sample microarray classification?交叉验证对小样本微阵列分类是否有效？

Bioinformatics. 2004 Feb 12;20(3):374-80. doi: 10.1093/bioinformatics/btg419.

Bias in error estimation when using cross-validation for model selection.在使用交叉验证进行模型选择时误差估计中的偏差。

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

Superior feature-set ranking for small samples using bolstered error estimation.使用增强误差估计对小样本进行卓越的特征集排序。

Bioinformatics. 2005 Apr 1;21(7):1046-54. doi: 10.1093/bioinformatics/bti081. Epub 2004 Oct 28.

Classifier performance prediction for computer-aided diagnosis using a limited dataset.使用有限数据集对计算机辅助诊断的分类器性能进行预测。

Med Phys. 2008 Apr;35(4):1559-70. doi: 10.1118/1.2868757.

Is cross-validation better than resubstitution for ranking genes?在对基因进行排名时，交叉验证是否比重替代法更好？

Bioinformatics. 2004 Jan 22;20(2):253-8. doi: 10.1093/bioinformatics/btg399.

引用本文的文献

Introduction to statistical simulations in health research.健康研究中的统计模拟简介。

BMJ Open. 2020 Dec 13;10(12):e039921. doi: 10.1136/bmjopen-2020-039921.

Gut-host Crosstalk: Methodological and Computational Challenges.肠道-宿主串扰：方法学和计算挑战。

Dig Dis Sci. 2020 Mar;65(3):686-694. doi: 10.1007/s10620-020-06105-9.

The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data.基于模型的使用RNA测序数据报告小特征集列表有效性的研究

Cancer Inform. 2017 Jun 12;16:1176935117710530. doi: 10.1177/1176935117710530. eCollection 2017.

On optimal Bayesian classification and risk estimation under multiple classes.关于多类情况下的最优贝叶斯分类与风险估计。

EURASIP J Bioinform Syst Biol. 2015 Oct 24;2015(1):8. doi: 10.1186/s13637-015-0028-3. eCollection 2015 Dec.

MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification.非高斯模型最优贝叶斯分类器的MCMC实现：基于模型的RNA测序分类

BMC Bioinformatics. 2014 Dec 10;15(1):401. doi: 10.1186/s12859-014-0401-3.

On the impoverishment of scientific education.论科学教育的匮乏

EURASIP J Bioinform Syst Biol. 2013 Nov 11;2013(1):15. doi: 10.1186/1687-4153-2013-15.

Scientific knowledge is possible with small-sample classification.小样本分类有助于获得科学知识。

EURASIP J Bioinform Syst Biol. 2013 Aug 20;2013(1):10. doi: 10.1186/1687-4153-2013-10.

On the limitations of biological knowledge.论生物知识的局限性。

Curr Genomics. 2012 Nov;13(7):574-87. doi: 10.2174/138920212803251445.

Performance reproducibility index for classification.分类性能再现性指数。

Bioinformatics. 2012 Nov 1;28(21):2824-33. doi: 10.1093/bioinformatics/bts509. Epub 2012 Sep 6.

The illusion of distribution-free small-sample classification in genomics.基因组学中小样本分类的无分布假象。

Curr Genomics. 2011 Aug;12(5):333-41. doi: 10.2174/138920211796429763.

本文引用的文献

Validation of computational methods in genomics.基因组学中计算方法的验证。

Curr Genomics. 2007 Mar;8(1):1-19. doi: 10.2174/138920207780076956.

Small sample issues for microarray-based classification.基于微阵列分类的小样本问题。

Comp Funct Genomics. 2001;2(1):28-34. doi: 10.1002/cfg.62.

Quantification of the impact of feature selection on the variance of cross-validation error estimation.特征选择对交叉验证误差估计方差影响的量化。

EURASIP J Bioinform Syst Biol. 2007;2007(1):16354. doi: 10.1155/2007/16354.

Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting.已发表的癌症预后微阵列研究的批判性综述以及统计分析与报告指南。

J Natl Cancer Inst. 2007 Jan 17;99(2):147-57. doi: 10.1093/jnci/djk018.

The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms.基于计数的误差估计器导致的关联问题及其对基因选择算法的影响。

Bioinformatics. 2006 Oct 15;22(20):2507-15. doi: 10.1093/bioinformatics/btl438. Epub 2006 Aug 14.

What should be expected from feature selection in small-sample settings.在小样本情况下，特征选择应达到什么预期效果。

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

Prediction error estimation: a comparison of resampling methods.预测误差估计：重采样方法的比较

Bioinformatics. 2005 Aug 1;21(15):3301-7. doi: 10.1093/bioinformatics/bti499. Epub 2005 May 19.

Prediction of cancer outcome with microarrays: a multiple random validation strategy.利用微阵列预测癌症预后：一种多重随机验证策略。

Lancet. 2005;365(9458):488-92. doi: 10.1016/S0140-6736(05)17866-0.

Optimal number of features as a function of sample size for various classification rules.针对各种分类规则，作为样本大小函数的最优特征数量。

Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.

Superior feature-set ranking for small samples using bolstered error estimation.使用增强误差估计对小样本进行卓越的特征集排序。

Bioinformatics. 2005 Apr 1;21(7):1046-54. doi: 10.1093/bioinformatics/bti081. Epub 2004 Oct 28.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验