Suppr超能文献

交叉验证对小样本微阵列分类是否有效?

Is cross-validation valid for small-sample microarray classification?

作者信息

Braga-Neto Ulisses M, Dougherty Edward R

机构信息

Section of Clinical Cancer Genetics, University of Texas MD Anderson Cancer Center, Houston, TX, USA.

出版信息

Bioinformatics. 2004 Feb 12;20(3):374-80. doi: 10.1093/bioinformatics/btg419.

Abstract

MOTIVATION

Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples.

RESULTS

An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules-linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)-using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution).

摘要

动机

微阵列分类通常具有两个显著特征:(1)分类器设计和误差估计基于非常小的样本,(2)大多数论文采用交叉验证误差估计。因此,有必要在非常小的样本背景下对交叉验证的行为有一个可量化的理解。

结果

已经进行了一项广泛的模拟研究,使用合成数据和真实乳腺癌患者数据,比较了三种流行分类规则(线性判别分析、3-最近邻和决策树(CART))的交叉验证、重新代入和自助法估计。通过估计误差与真实误差之间差异的分布进行比较。计算了偏差分布的各种统计量:均值(用于估计偏差)、方差(用于估计精度)、均方根误差(用于偏差和方差的综合)以及四分位数范围,包括异常值行为。总体而言,虽然交叉验证误差估计的偏差远小于重新代入,但它显示出过大的方差,这使得对于小样本的单个估计不可靠。自助法相对于方差提供了更好的性能,但计算成本高,且通常偏差会增加(尽管比重新代入小得多)。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验