Suppr超能文献

高维环境下真实分类器误差与估计分类器误差的去相关。

Decorrelation of the true and estimated classifier errors in high-dimensional settings.

作者信息

Hanczar Blaise, Hua Jianping, Dougherty Edward R

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.

出版信息

EURASIP J Bioinform Syst Biol. 2007;2007(1):38473. doi: 10.1155/2007/38473.

Abstract

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.

摘要

许多微阵列实验的目的是构建鉴别诊断和预后模型。鉴于特征数量巨大而样本数量较少,模型有效性(即误差估计的精度)是一个关键问题。先前的研究通过偏差分布(估计误差减去真实误差)来解决这个问题,特别是在使用特征选择来减轻峰值现象(过拟合)的高维环境中交叉验证精度的恶化。由于分类器设计基于随机样本,真实误差和估计误差都是依赖于样本的随机变量,如果估计误差和真实误差没有很好的相关性,人们会预期精度会有所损失,因此会自然产生关于相关性程度以及缺乏相关性影响误差估计的方式等问题。我们通过对偏差分布的方差进行分解来证明相关性对误差精度的影响,观察到在高维环境中相关性通常会严重降低,并表明高维度对误差估计的影响更多地源于其去相关效应,而不是对估计误差方差的影响。我们使用合成数据和真实数据、几种特征选择方法、不同的分类规则以及三种常用的误差估计器(留一法交叉验证、k折交叉验证和.632自举法)来考虑不同实验条件下真实误差和估计误差之间的相关性。此外,考虑了三种情况:(1)特征选择,(2)已知特征集,(3)所有特征。只有第一种情况具有实际意义;然而,为了比较目的需要另外两种情况。我们将观察到,在已知特征集的情况下,真实误差和估计误差的相关性往往比特征选择或使用所有特征的情况更强,后两者之间较好的相关性没有普遍趋势,但因不同模型而异。

相似文献

5
What should be expected from feature selection in small-sample settings.在小样本情况下,特征选择应达到什么预期效果。
Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

引用本文的文献

1
Introduction to statistical simulations in health research.健康研究中的统计模拟简介。
BMJ Open. 2020 Dec 13;10(12):e039921. doi: 10.1136/bmjopen-2020-039921.
4
On optimal Bayesian classification and risk estimation under multiple classes.关于多类情况下的最优贝叶斯分类与风险估计。
EURASIP J Bioinform Syst Biol. 2015 Oct 24;2015(1):8. doi: 10.1186/s13637-015-0028-3. eCollection 2015 Dec.
6
On the impoverishment of scientific education.论科学教育的匮乏
EURASIP J Bioinform Syst Biol. 2013 Nov 11;2013(1):15. doi: 10.1186/1687-4153-2013-15.
7
Scientific knowledge is possible with small-sample classification.小样本分类有助于获得科学知识。
EURASIP J Bioinform Syst Biol. 2013 Aug 20;2013(1):10. doi: 10.1186/1687-4153-2013-10.
8
On the limitations of biological knowledge.论生物知识的局限性。
Curr Genomics. 2012 Nov;13(7):574-87. doi: 10.2174/138920212803251445.
9
Performance reproducibility index for classification.分类性能再现性指数。
Bioinformatics. 2012 Nov 1;28(21):2824-33. doi: 10.1093/bioinformatics/bts509. Epub 2012 Sep 6.

本文引用的文献

1
Validation of computational methods in genomics.基因组学中计算方法的验证。
Curr Genomics. 2007 Mar;8(1):1-19. doi: 10.2174/138920207780076956.
6
What should be expected from feature selection in small-sample settings.在小样本情况下,特征选择应达到什么预期效果。
Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.
7
Prediction error estimation: a comparison of resampling methods.预测误差估计:重采样方法的比较
Bioinformatics. 2005 Aug 1;21(15):3301-7. doi: 10.1093/bioinformatics/bti499. Epub 2005 May 19.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验