基因组学中小样本分类的无分布假象。

The illusion of distribution-free small-sample classification in genomics.

机构信息

Department of Electrical and Computer Engineering, Texas A&M University.

出版信息

Curr Genomics. 2011 Aug;12(5):333-41. doi: 10.2174/138920211796429763.

DOI:10.2174/138920211796429763

PMID:22294876

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3145263/

Abstract

Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.

摘要

分类已成为生物信息学中的一个主要研究领域，因为人们希望使用高通量基因组数据来区分表型，特别是疾病状况。虽然已经提出了许多分类规则，但缺乏误差估计规则，甚至更缺乏关于误差估计准确性的理论。这是有问题的，因为分类器的价值主要取决于其错误率。在生物信息学论文中，常见的做法是将分类规则应用于一个小的标记数据集，并在同一数据集上估计由此产生的分类器的错误，通常通过交叉验证来完成，而无需对基础特征-标签分布做出任何假设。伴随着缺乏分布假设的是，没有关于误差估计准确性的任何陈述。如果没有这样的准确性度量，最常见的度量是均方根 (RMS)，则误差估计基本上是没有意义的，整个论文的价值是值得怀疑的。在小样本设置中，缺乏分布假设和误差估计准确性度量的同时存在是可以保证的，因为即使存在无分布界限（而且这种情况很少见），在界限下所需的样本量也很大，以至于对小样本来说是无用的。因此，需要分布界限，并且需要陈述分布假设。由于分类器对其估计误差的准确性存在认识论上的依赖性，因此在高通量、小样本生物学中进行有科学意义的无分布分类是一种幻想。

相似文献

The illusion of distribution-free small-sample classification in genomics.基因组学中小样本分类的无分布假象。

Curr Genomics. 2011 Aug;12(5):333-41. doi: 10.2174/138920211796429763.

Moments and Root-Mean-Square Error of the Bayesian MMSE Estimator of Classification Error in the Gaussian Model.高斯模型中分类误差的贝叶斯最小均方误差估计器的矩和均方根误差

Pattern Recognit. 2014 Jun 1;47(6):2178-2192. doi: 10.1016/j.patcog.2013.11.022.

Scientific knowledge is possible with small-sample classification.小样本分类有助于获得科学知识。

EURASIP J Bioinform Syst Biol. 2013 Aug 20;2013(1):10. doi: 10.1186/1687-4153-2013-10.

Confidence intervals for the true classification error conditioned on the estimated error.基于估计误差的真实分类误差的置信区间。

Technol Cancer Res Treat. 2006 Dec;5(6):579-89. doi: 10.1177/153303460600500605.

Decorrelation of the true and estimated classifier errors in high-dimensional settings.高维环境下真实分类器误差与估计分类器误差的去相关。

EURASIP J Bioinform Syst Biol. 2007;2007(1):38473. doi: 10.1155/2007/38473.

Classification and error estimation for discrete data.离散数据的分类与误差估计。

Curr Genomics. 2009 Nov;10(7):446-62. doi: 10.2174/138920209789208228.

Quantification of the impact of feature selection on the variance of cross-validation error estimation.特征选择对交叉验证误差估计方差影响的量化。

EURASIP J Bioinform Syst Biol. 2007;2007(1):16354. doi: 10.1155/2007/16354.

Small sample issues for microarray-based classification.基于微阵列分类的小样本问题。

Comp Funct Genomics. 2001;2(1):28-34. doi: 10.1002/cfg.62.

On optimal Bayesian classification and risk estimation under multiple classes.关于多类情况下的最优贝叶斯分类与风险估计。

EURASIP J Bioinform Syst Biol. 2015 Oct 24;2015(1):8. doi: 10.1186/s13637-015-0028-3. eCollection 2015 Dec.

Strong feature sets from small samples.来自小样本的强大特征集。

J Comput Biol. 2002;9(1):127-46. doi: 10.1089/10665270252833226.

引用本文的文献

Knowledge-driven learning, optimization, and experimental design under uncertainty for materials discovery.不确定性条件下材料发现的知识驱动学习、优化与实验设计

Patterns (N Y). 2023 Nov 10;4(11):100863. doi: 10.1016/j.patter.2023.100863.

Sign-Consistency Based Variable Importance for Machine Learning in Brain Imaging.基于符号一致性的脑影像机器学习中的变量重要性。

Neuroinformatics. 2019 Oct;17(4):593-609. doi: 10.1007/s12021-019-9415-3.

Incorporating biological prior knowledge for Bayesian learning via maximal knowledge-driven information priors.通过最大知识驱动信息先验将生物先验知识纳入贝叶斯学习。

BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):552. doi: 10.1186/s12859-017-1893-4.

Accelerated search for BaTiO3-based piezoelectrics with vertical morphotropic phase boundary using Bayesian learning.使用贝叶斯学习加速寻找具有垂直同型相界的钛酸钡基压电材料。

Proc Natl Acad Sci U S A. 2016 Nov 22;113(47):13301-13306. doi: 10.1073/pnas.1607412113. Epub 2016 Nov 7.

Accelerated search for materials with targeted properties by adaptive design.通过自适应设计加速寻找具有目标特性的材料。

Nat Commun. 2016 Apr 15;7:11241. doi: 10.1038/ncomms11241.

Adaptive Strategies for Materials Design using Uncertainties.利用不确定性进行材料设计的自适应策略

Sci Rep. 2016 Jan 21;6:19660. doi: 10.1038/srep19660.

On optimal Bayesian classification and risk estimation under multiple classes.关于多类情况下的最优贝叶斯分类与风险估计。

EURASIP J Bioinform Syst Biol. 2015 Oct 24;2015(1):8. doi: 10.1186/s13637-015-0028-3. eCollection 2015 Dec.

MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification.非高斯模型最优贝叶斯分类器的MCMC实现：基于模型的RNA测序分类

BMC Bioinformatics. 2014 Dec 10;15(1):401. doi: 10.1186/s12859-014-0401-3.

Pattern Recognit. 2014 Jun 1;47(6):2178-2192. doi: 10.1016/j.patcog.2013.11.022.

Scientific knowledge is possible with small-sample classification.小样本分类有助于获得科学知识。

EURASIP J Bioinform Syst Biol. 2013 Aug 20;2013(1):10. doi: 10.1186/1687-4153-2013-10.

本文引用的文献

Multiple-rule bias in the comparison of classification rules.分类规则比较中的多重规则偏差。

Bioinformatics. 2011 Jun 15;27(12):1675-83. doi: 10.1093/bioinformatics/btr262. Epub 2011 May 5.

Over-optimism in bioinformatics: an illustration.生物信息学中的过度乐观：一个例证。

Bioinformatics. 2010 Aug 15;26(16):1990-8. doi: 10.1093/bioinformatics/btq323. Epub 2010 Jun 26.

Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction.最优分类器选择和误差率估计中的负偏差：高维预测中的实证研究。

BMC Med Res Methodol. 2009 Dec 21;9:85. doi: 10.1186/1471-2288-9-85.

Over-optimism in bioinformatics research.生物信息学研究中的过度乐观情绪。

Bioinformatics. 2010 Feb 1;26(3):437-9. doi: 10.1093/bioinformatics/btp648. Epub 2009 Nov 26.

Reporting bias when using real data sets to analyze classification performance.使用真实数据集分析分类性能时的报告偏倚。

Bioinformatics. 2010 Jan 1;26(1):68-76. doi: 10.1093/bioinformatics/btp605. Epub 2009 Oct 21.

On the epistemological crisis in genomics.论基因组学的认识论危机。

Curr Genomics. 2008 Apr;9(2):69-79. doi: 10.2174/138920208784139546.

Decorrelation of the true and estimated classifier errors in high-dimensional settings.高维环境下真实分类器误差与估计分类器误差的去相关。

EURASIP J Bioinform Syst Biol. 2007;2007(1):38473. doi: 10.1155/2007/38473.

Confidence intervals for the true classification error conditioned on the estimated error.基于估计误差的真实分类误差的置信区间。

Technol Cancer Res Treat. 2006 Dec;5(6):579-89. doi: 10.1177/153303460600500605.

Towards sound epistemological foundations of statistical methods for high-dimensional biology.迈向高维生物学统计方法合理的认识论基础。

Nat Genet. 2004 Sep;36(9):943-7. doi: 10.1038/ng1422.

Is cross-validation valid for small-sample microarray classification?交叉验证对小样本微阵列分类是否有效？

Bioinformatics. 2004 Feb 12;20(3):374-80. doi: 10.1093/bioinformatics/btg419.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基因组学中小样本分类的无分布假象。

The illusion of distribution-free small-sample classification in genomics.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献