构建一个分类器需要多少样本：一种通用的顺序方法。

How many samples are needed to build a classifier: a general sequential approach.

作者信息

Fu Wenjiang J, Dougherty Edward R, Mallick Bani, Carroll Raymond J

机构信息

Department of Statistics, Texas A&M University, 447 Blocker Building, College Station, TX 77843, USA.

出版信息

Bioinformatics. 2005 Jan 1;21(1):63-70. doi: 10.1093/bioinformatics/bth461. Epub 2004 Aug 5.

DOI:10.1093/bioinformatics/bth461

PMID:15297303

Abstract

MOTIVATION

The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable.

RESULTS

A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction.

AVAILABILITY

R-code for the sequential stopping rule is available at http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html

摘要

动机

分类器设计的标准范式是获取特征-标签对的样本，然后应用分类规则从样本数据中推导分类器。通常在实验室情况下，样本量受到成本、时间或样本材料可用性的限制。因此，研究者可能希望考虑一种序贯方法，即有足够数量的患者来训练分类器，以便做出合理的诊断决策，同时尽可能减少患者数量以使研究具有可行性。

结果

通过鞅中心极限定理研究了一种序贯分类程序。它在每一步更新分类规则，并提供停止标准，以确保在一定置信度下，停止时未来受试者的误分类概率小于预定阈值。提供了模拟研究以及在微阵列数据分析中的应用。该程序具有几个吸引人的特性：（1）它序贯更新分类规则，因此不依赖于其他研究中主要测量值的分布；（2）它在每个序贯步骤评估停止标准，因此可以通过提前停止大幅降低成本；（3）它不限于任何特定的分类规则，因此适用于任何参数或非参数方法，包括特征选择或提取。

可用性

序贯停止规则的R代码可在http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html获取

相似文献

How many samples are needed to build a classifier: a general sequential approach.

Bioinformatics. 2005 Jan 1;21(1):63-70. doi: 10.1093/bioinformatics/bth461. Epub 2004 Aug 5.

Optimal number of features as a function of sample size for various classification rules.

Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.

Empirical Bayes screening of many p-values with applications to microarray studies.

Bioinformatics. 2005 May 1;21(9):1987-94. doi: 10.1093/bioinformatics/bti301. Epub 2005 Feb 2.

Classification with reject option in gene expression data.

Bioinformatics. 2008 Sep 1;24(17):1889-95. doi: 10.1093/bioinformatics/btn349. Epub 2008 Jul 10.

What should be expected from feature selection in small-sample settings.

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

Classification of microarray data with factor mixture models.

Bioinformatics. 2006 Jan 15;22(2):202-8. doi: 10.1093/bioinformatics/bti779. Epub 2005 Nov 15.

Genetic test bed for feature selection.

Bioinformatics. 2006 Apr 1;22(7):837-42. doi: 10.1093/bioinformatics/btl008. Epub 2006 Jan 20.

Improved centroids estimation for the nearest shrunken centroid classifier.

Bioinformatics. 2007 Apr 15;23(8):972-9. doi: 10.1093/bioinformatics/btm046. Epub 2007 Mar 24.

A hierarchical Naïve Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays.

BMC Bioinformatics. 2006 Nov 24;7:514. doi: 10.1186/1471-2105-7-514.

Independent component analysis-based penalized discriminant method for tumor classification using gene expression data.

Bioinformatics. 2006 Aug 1;22(15):1855-62. doi: 10.1093/bioinformatics/btl190. Epub 2006 May 18.

引用本文的文献

Gut-host Crosstalk: Methodological and Computational Challenges.

Dig Dis Sci. 2020 Mar;65(3):686-694. doi: 10.1007/s10620-020-06105-9.

Determination of minimum training sample size for microarray-based cancer outcome prediction-an empirical assessment.

PLoS One. 2013 Jul 5;8(7):e68579. doi: 10.1371/journal.pone.0068579. Print 2013.

Optimally splitting cases for training and testing high dimensional classifiers.

BMC Med Genomics. 2011 Apr 8;4:31. doi: 10.1186/1755-8794-4-31.

Statistics and bioinformatics in nutritional sciences: analysis of complex data in the era of systems biology.

J Nutr Biochem. 2010 Jul;21(7):561-72. doi: 10.1016/j.jnutbio.2009.11.007. Epub 2010 Mar 16.

Bias-corrected diagonal discriminant rules for high-dimensional classification.

Biometrics. 2010 Dec;66(4):1096-106. doi: 10.1111/j.1541-0420.2010.01395.x.

A simulation-approximation approach to sample size planning for high-dimensional classification studies.

Biostatistics. 2009 Jul;10(3):424-35. doi: 10.1093/biostatistics/kxp001. Epub 2009 Feb 21.

Development and Validation of Biomarker Classifiers for Treatment Selection.

J Stat Plan Inference. 2008 Feb 1;138(2):308-320. doi: 10.1016/j.jspi.2007.06.010.

A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions.

Biostatistics. 2009 Apr;10(2):282-96. doi: 10.1093/biostatistics/kxn035. Epub 2008 Nov 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

构建一个分类器需要多少样本：一种通用的顺序方法。

How many samples are needed to build a classifier: a general sequential approach.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献