Suppr超能文献

基因组学中小样本分类的无分布假象。

The illusion of distribution-free small-sample classification in genomics.

机构信息

Department of Electrical and Computer Engineering, Texas A&M University.

出版信息

Curr Genomics. 2011 Aug;12(5):333-41. doi: 10.2174/138920211796429763.

Abstract

Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.

摘要

分类已成为生物信息学中的一个主要研究领域,因为人们希望使用高通量基因组数据来区分表型,特别是疾病状况。虽然已经提出了许多分类规则,但缺乏误差估计规则,甚至更缺乏关于误差估计准确性的理论。这是有问题的,因为分类器的价值主要取决于其错误率。在生物信息学论文中,常见的做法是将分类规则应用于一个小的标记数据集,并在同一数据集上估计由此产生的分类器的错误,通常通过交叉验证来完成,而无需对基础特征-标签分布做出任何假设。伴随着缺乏分布假设的是,没有关于误差估计准确性的任何陈述。如果没有这样的准确性度量,最常见的度量是均方根 (RMS),则误差估计基本上是没有意义的,整个论文的价值是值得怀疑的。在小样本设置中,缺乏分布假设和误差估计准确性度量的同时存在是可以保证的,因为即使存在无分布界限(而且这种情况很少见),在界限下所需的样本量也很大,以至于对小样本来说是无用的。因此,需要分布界限,并且需要陈述分布假设。由于分类器对其估计误差的准确性存在认识论上的依赖性,因此在高通量、小样本生物学中进行有科学意义的无分布分类是一种幻想。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验