• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在对基因进行排名时,交叉验证是否比重替代法更好?

Is cross-validation better than resubstitution for ranking genes?

作者信息

Braga-Neto Ulisses, Hashimoto Ronaldo, Dougherty Edward R, Nguyen Danh V, Carroll Raymond J

机构信息

Section of Clinical Cancer Genetics, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA.

出版信息

Bioinformatics. 2004 Jan 22;20(2):253-8. doi: 10.1093/bioinformatics/btg399.

DOI:10.1093/bioinformatics/btg399
PMID:14734317
Abstract

MOTIVATION

Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance.

RESULTS

A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.

摘要

动机

对基因特征集进行排序是表型分类(例如DNA微阵列实验中的肿瘤分类)以及基因调控网络背景下预测的关键问题。有两种广泛使用的方法来估计分类器的误差(误分类率)。再代入法将单个分类器拟合到数据上,然后依次将该分类器应用于每个数据观测值。交叉验证(留一法形式)依次移除每个观测值,构建分类器,然后计算这个留一法分类器是否正确分类被删除的观测值。再代入法通常会低估分类器误差,在许多情况下严重低估。交叉验证的优点是能产生有效无偏的误差估计,但该估计具有高度变异性。在许多应用中,人们感兴趣的并非误分类率本身,而是具有分类或预测潜力的基因集的构建。因此,需要根据特征集的性能对其进行排序。

结果

采用基于模型的方法,比较再代入法和交叉验证法在基于实值特征集的分类以及概率布尔网络(PBN)背景下预测时的排序性能。对于分类,考虑高斯模型,以及通过线性判别分析和3近邻分类规则进行分类。在PBN的稳态分布中检验预测。提出了三个指标,用于比较基于误差估计的特征集排序与基于真实误差(由于基于模型的方法而可知)的排序。在所有情况下,相对于排序准确性,再代入法与交叉验证法具有竞争力。此外,再代入法在计算时间上有巨大节省。

相似文献

1
Is cross-validation better than resubstitution for ranking genes?在对基因进行排名时,交叉验证是否比重替代法更好?
Bioinformatics. 2004 Jan 22;20(2):253-8. doi: 10.1093/bioinformatics/btg399.
2
Superior feature-set ranking for small samples using bolstered error estimation.使用增强误差估计对小样本进行卓越的特征集排序。
Bioinformatics. 2005 Apr 1;21(7):1046-54. doi: 10.1093/bioinformatics/bti081. Epub 2004 Oct 28.
3
Is cross-validation valid for small-sample microarray classification?交叉验证对小样本微阵列分类是否有效?
Bioinformatics. 2004 Feb 12;20(3):374-80. doi: 10.1093/bioinformatics/btg419.
4
Bias in error estimation when using cross-validation for model selection.在使用交叉验证进行模型选择时误差估计中的偏差。
BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.
5
Feature selection and nearest centroid classification for protein mass spectrometry.蛋白质质谱的特征选择与最近质心分类
BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.
6
Prediction error estimation: a comparison of resampling methods.预测误差估计:重采样方法的比较
Bioinformatics. 2005 Aug 1;21(15):3301-7. doi: 10.1093/bioinformatics/bti499. Epub 2005 May 19.
7
Optimal number of features as a function of sample size for various classification rules.针对各种分类规则,作为样本大小函数的最优特征数量。
Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.
8
Corrected small-sample estimation of the Bayes error.
Bioinformatics. 2003 May 22;19(8):944-51. doi: 10.1093/bioinformatics/btg144.
9
The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms.基于计数的误差估计器导致的关联问题及其对基因选择算法的影响。
Bioinformatics. 2006 Oct 15;22(20):2507-15. doi: 10.1093/bioinformatics/btl438. Epub 2006 Aug 14.
10
Instance-based concept learning from multiclass DNA microarray data.基于实例的多类DNA微阵列数据概念学习
BMC Bioinformatics. 2006 Feb 16;7:73. doi: 10.1186/1471-2105-7-73.

引用本文的文献

1
Nanoparticle Skin Penetration: Depths and Routes Modeled In-Silico.纳米颗粒的皮肤渗透:计算机模拟的深度和途径
Small. 2025 May;21(20):e2412541. doi: 10.1002/smll.202412541. Epub 2025 Mar 27.
2
Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions.基于机器学习的计算基因选择模型:综述、性能评估、开放问题及未来研究方向
Front Genet. 2020 Dec 10;11:603808. doi: 10.3389/fgene.2020.603808. eCollection 2020.
3
Dual-specificity phosphatase (DUSP) genetic variants predict pulmonary hypertension in patients with bronchopulmonary dysplasia.
双特异性磷酸酶(DUSP)基因变异可预测支气管肺发育不良患者的肺动脉高压。
Pediatr Res. 2020 Jan;87(1):81-87. doi: 10.1038/s41390-019-0502-9. Epub 2019 Jul 22.
4
Objective detection of chronic stress using physiological parameters.使用生理参数进行慢性应激的客观检测。
Med Biol Eng Comput. 2018 Dec;56(12):2273-2286. doi: 10.1007/s11517-018-1854-8. Epub 2018 Jun 18.
5
RAPIDSNPs: A new computational pipeline for rapidly identifying key genetic variants reveals previously unidentified SNPs that are significantly associated with individual platelet responses.RAPIDSNPs:一种用于快速识别关键基因变异的新计算流程揭示了与个体血小板反应显著相关的先前未识别的单核苷酸多态性。
PLoS One. 2017 Apr 25;12(4):e0175957. doi: 10.1371/journal.pone.0175957. eCollection 2017.
6
SNP by SNP by environment interaction network of alcoholism.酗酒的单核苷酸多态性(SNP)与环境相互作用网络
BMC Syst Biol. 2017 Mar 14;11(Suppl 3):19. doi: 10.1186/s12918-017-0403-7.
7
Unbiased bootstrap error estimation for linear discriminant analysis.线性判别分析的无偏自助法误差估计
EURASIP J Bioinform Syst Biol. 2014 Oct 3;2014:15. doi: 10.1186/s13637-014-0015-0. eCollection 2014 Dec.
8
RiGoR: reporting guidelines to address common sources of bias in risk model development.RiGoR:解决风险模型开发中常见偏倚来源的报告指南。
Biomark Res. 2015 Jan 24;3(1):2. doi: 10.1186/s40364-014-0027-7. eCollection 2015.
9
Radiomics: the process and the challenges.放射组学:流程与挑战。
Magn Reson Imaging. 2012 Nov;30(9):1234-48. doi: 10.1016/j.mri.2012.06.010. Epub 2012 Aug 13.
10
Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations.通过快速矩阵运算在全基因组关联研究中基于包装法选择遗传特征。
Algorithms Mol Biol. 2012 May 2;7(1):11. doi: 10.1186/1748-7188-7-11.