• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

交叉验证对小样本微阵列分类是否有效?

Is cross-validation valid for small-sample microarray classification?

作者信息

Braga-Neto Ulisses M, Dougherty Edward R

机构信息

Section of Clinical Cancer Genetics, University of Texas MD Anderson Cancer Center, Houston, TX, USA.

出版信息

Bioinformatics. 2004 Feb 12;20(3):374-80. doi: 10.1093/bioinformatics/btg419.

DOI:10.1093/bioinformatics/btg419
PMID:14960464
Abstract

MOTIVATION

Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples.

RESULTS

An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules-linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)-using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution).

摘要

动机

微阵列分类通常具有两个显著特征:(1)分类器设计和误差估计基于非常小的样本,(2)大多数论文采用交叉验证误差估计。因此,有必要在非常小的样本背景下对交叉验证的行为有一个可量化的理解。

结果

已经进行了一项广泛的模拟研究,使用合成数据和真实乳腺癌患者数据,比较了三种流行分类规则(线性判别分析、3-最近邻和决策树(CART))的交叉验证、重新代入和自助法估计。通过估计误差与真实误差之间差异的分布进行比较。计算了偏差分布的各种统计量:均值(用于估计偏差)、方差(用于估计精度)、均方根误差(用于偏差和方差的综合)以及四分位数范围,包括异常值行为。总体而言,虽然交叉验证误差估计的偏差远小于重新代入,但它显示出过大的方差,这使得对于小样本的单个估计不可靠。自助法相对于方差提供了更好的性能,但计算成本高,且通常偏差会增加(尽管比重新代入小得多)。

相似文献

1
Is cross-validation valid for small-sample microarray classification?交叉验证对小样本微阵列分类是否有效?
Bioinformatics. 2004 Feb 12;20(3):374-80. doi: 10.1093/bioinformatics/btg419.
2
Is cross-validation better than resubstitution for ranking genes?在对基因进行排名时,交叉验证是否比重替代法更好?
Bioinformatics. 2004 Jan 22;20(2):253-8. doi: 10.1093/bioinformatics/btg399.
3
Estimating misclassification error with small samples via bootstrap cross-validation.通过自助法交叉验证用小样本估计误分类误差。
Bioinformatics. 2005 May 1;21(9):1979-86. doi: 10.1093/bioinformatics/bti294. Epub 2005 Feb 2.
4
Prediction error estimation: a comparison of resampling methods.预测误差估计:重采样方法的比较
Bioinformatics. 2005 Aug 1;21(15):3301-7. doi: 10.1093/bioinformatics/bti499. Epub 2005 May 19.
5
Superior feature-set ranking for small samples using bolstered error estimation.使用增强误差估计对小样本进行卓越的特征集排序。
Bioinformatics. 2005 Apr 1;21(7):1046-54. doi: 10.1093/bioinformatics/bti081. Epub 2004 Oct 28.
6
Bias in error estimation when using cross-validation for model selection.在使用交叉验证进行模型选择时误差估计中的偏差。
BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.
7
Optimal number of features as a function of sample size for various classification rules.针对各种分类规则,作为样本大小函数的最优特征数量。
Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.
8
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis.用于微阵列基因表达癌症诊断的多类别分类方法的综合评估。
Bioinformatics. 2005 Mar 1;21(5):631-43. doi: 10.1093/bioinformatics/bti033. Epub 2004 Sep 16.
9
Improved bolstering error estimation for gene ranking.改进基因排名的支撑误差估计。
Annu Int Conf IEEE Eng Med Biol Soc. 2007;2007:4633-6. doi: 10.1109/IEMBS.2007.4353372.
10
What should be expected from feature selection in small-sample settings.在小样本情况下,特征选择应达到什么预期效果。
Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

引用本文的文献

1
Cancer classification in high dimensional microarray gene expressions by feature selection using eagle prey optimization.基于鹰猎物优化特征选择的高维微阵列基因表达中的癌症分类
Front Genet. 2025 Mar 21;16:1528810. doi: 10.3389/fgene.2025.1528810. eCollection 2025.
2
An interpretable predictive deep learning platform for pediatric metabolic diseases.一个可解释的预测性深度学习平台,用于儿科代谢疾病。
J Am Med Inform Assoc. 2024 May 20;31(6):1227-1238. doi: 10.1093/jamia/ocae049.
3
A megastudy on the predictability of personal information from facial images: Disentangling demographic and non-demographic signals.
一项关于从面部图像预测个人信息的巨量研究:解析人口统计学和非人口统计学信号。
Sci Rep. 2023 Nov 29;13(1):21073. doi: 10.1038/s41598-023-42054-9.
4
Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning.在最优贝叶斯迁移学习背景下用于误差估计的稳健重要性抽样。
Patterns (N Y). 2022 Jan 25;3(3):100428. doi: 10.1016/j.patter.2021.100428. eCollection 2022 Mar 11.
5
Predicting Working Memory in Healthy Older Adults Using Real-Life Language and Social Context Information: A Machine Learning Approach.使用现实生活中的语言和社会背景信息预测健康老年人的工作记忆:一种机器学习方法。
JMIR Aging. 2022 Mar 8;5(1):e28333. doi: 10.2196/28333.
6
Non-invasive monitoring of multiple wildlife health factors by fecal microbiome analysis.通过粪便微生物组分析对多种野生动物健康因素进行非侵入性监测。
Ecol Evol. 2022 Feb 9;12(2):e8564. doi: 10.1002/ece3.8564. eCollection 2022 Feb.
7
Simultaneous serotonin and dopamine monitoring across timescales by rapid pulse voltammetry with partial least squares regression.通过快速脉冲伏安法和偏最小二乘回归同时监测跨时间尺度的血清素和多巴胺。
Anal Bioanal Chem. 2021 Nov;413(27):6747-6767. doi: 10.1007/s00216-021-03665-1. Epub 2021 Oct 23.
8
Predicting Future Geographic Hotspots of Potentially Preventable Hospitalisations Using All Subset Model Selection and Repeated K-Fold Cross-Validation.使用全子集模型选择和重复 K 折交叉验证预测潜在可预防住院的未来地理热点。
Int J Environ Res Public Health. 2021 Sep 29;18(19):10253. doi: 10.3390/ijerph181910253.
9
A novel feature selection algorithm based on damping oscillation theory.一种基于阻尼振荡理论的新特征选择算法。
PLoS One. 2021 Aug 6;16(8):e0255307. doi: 10.1371/journal.pone.0255307. eCollection 2021.
10
Machine Learning Protocols in Early Cancer Detection Based on Liquid Biopsy: A Survey.基于液体活检的早期癌症检测中的机器学习协议:一项综述。
Life (Basel). 2021 Jun 30;11(7):638. doi: 10.3390/life11070638.