Suppr超能文献

构建一个分类器需要多少样本:一种通用的顺序方法。

How many samples are needed to build a classifier: a general sequential approach.

作者信息

Fu Wenjiang J, Dougherty Edward R, Mallick Bani, Carroll Raymond J

机构信息

Department of Statistics, Texas A&M University, 447 Blocker Building, College Station, TX 77843, USA.

出版信息

Bioinformatics. 2005 Jan 1;21(1):63-70. doi: 10.1093/bioinformatics/bth461. Epub 2004 Aug 5.

Abstract

MOTIVATION

The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable.

RESULTS

A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction.

AVAILABILITY

R-code for the sequential stopping rule is available at http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html

摘要

动机

分类器设计的标准范式是获取特征-标签对的样本,然后应用分类规则从样本数据中推导分类器。通常在实验室情况下,样本量受到成本、时间或样本材料可用性的限制。因此,研究者可能希望考虑一种序贯方法,即有足够数量的患者来训练分类器,以便做出合理的诊断决策,同时尽可能减少患者数量以使研究具有可行性。

结果

通过鞅中心极限定理研究了一种序贯分类程序。它在每一步更新分类规则,并提供停止标准,以确保在一定置信度下,停止时未来受试者的误分类概率小于预定阈值。提供了模拟研究以及在微阵列数据分析中的应用。该程序具有几个吸引人的特性:(1)它序贯更新分类规则,因此不依赖于其他研究中主要测量值的分布;(2)它在每个序贯步骤评估停止标准,因此可以通过提前停止大幅降低成本;(3)它不限于任何特定的分类规则,因此适用于任何参数或非参数方法,包括特征选择或提取。

可用性

序贯停止规则的R代码可在http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html获取

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验