预测分类性能所需的样本量。

Predicting sample size required for classification performance.

机构信息

Dep. Ing. Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Concepción, Chile.

出版信息

BMC Med Inform Decis Mak. 2012 Feb 15;12:8. doi: 10.1186/1472-6947-12-8.

DOI:10.1186/1472-6947-12-8

PMID:22336388

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3307431/

Abstract

BACKGROUND

Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target.

METHODS

We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method.

RESULTS

A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05).

CONCLUSIONS

This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.

摘要

背景

监督学习方法需要有标注数据才能生成高效的模型。然而，标注数据相对稀缺，获取成本也很高。无论是被动学习还是主动学习方法，都需要估计达到性能目标所需的标注样本量。

方法

我们设计并实现了一种方法，该方法使用小的标注训练集创建的学习曲线中的点拟合反幂律模型。拟合通过非线性加权最小二乘优化完成。然后，使用拟合模型预测更大样本量的分类器性能和置信区间。为了评估，我们将非线性加权曲线拟合方法应用于使用主动和被动采样方法生成的一组学习曲线，并使用标准拟合优度度量对预测进行验证。作为对照，我们使用了未加权的拟合方法。

结果

共拟合了 568 个模型，并将模型预测与观察到的性能进行了比较。根据数据集和采样方法的不同，需要 80 到 560 个标注样本才能使平均均方误差和根均方误差低于 0.01。结果还表明，我们的加权拟合方法优于基线未加权方法（p<0.05）。

结论

本文描述了一种简单有效的样本量预测算法，它对学习曲线进行加权拟合。该算法优于以前文献中描述的未加权算法。它可以帮助研究人员确定监督机器学习的标注样本量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef4e/3307431/deb25a2620fd/1472-6947-12-8-1.jpg

相似文献

Predicting sample size required for classification performance.

BMC Med Inform Decis Mak. 2012 Feb 15;12:8. doi: 10.1186/1472-6947-12-8.

Applying active learning to assertion classification of concepts in clinical text.

J Biomed Inform. 2012 Apr;45(2):265-72. doi: 10.1016/j.jbi.2011.11.003. Epub 2011 Nov 22.

Applying active learning to supervised word sense disambiguation in MEDLINE.

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):1001-6. doi: 10.1136/amiajnl-2012-001244. Epub 2013 Jan 30.

Text classification performance: is the sample size the only factor to be considered?

Stud Health Technol Inform. 2013;192:1193.

Restricted Boltzmann machines based oversampling and semi-supervised learning for false positive reduction in breast CAD.

Biomed Mater Eng. 2015;26 Suppl 1:S1541-7. doi: 10.3233/BME-151453.

Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments.

PLoS One. 2016 Sep 14;11(9):e0162075. doi: 10.1371/journal.pone.0162075. eCollection 2016.

Learning curves in classification with microarray data.

Semin Oncol. 2010 Feb;37(1):65-8. doi: 10.1053/j.seminoncol.2009.12.002.

Active learning for clinical text classification: is it better than random sampling?

J Am Med Inform Assoc. 2012 Sep-Oct;19(5):809-16. doi: 10.1136/amiajnl-2011-000648. Epub 2012 Jun 15.

Weighting training images by maximizing distribution similarity for supervised segmentation across scanners.

Med Image Anal. 2015 Aug;24(1):245-254. doi: 10.1016/j.media.2015.06.010. Epub 2015 Jul 7.

Bias correction for selecting the minimal-error classifier from many machine learning models.

Bioinformatics. 2014 Nov 15;30(22):3152-8. doi: 10.1093/bioinformatics/btu520. Epub 2014 Aug 1.

引用本文的文献

Radiomics and deep learning methods for predicting the growth of subsolid nodules based on CT images.

Medicine (Baltimore). 2025 Aug 29;104(35):e44104. doi: 10.1097/MD.0000000000044104.

Predicting language outcome after stroke using machine learning: in search of the big data benefit.

Neuroimage Clin. 2025 Aug 6;48:103858. doi: 10.1016/j.nicl.2025.103858.

A Machine Learning-Based Prognostication Model Enhances Prediction of Early Hepatic Encephalopathy in Patients With Noncancer-Related Cirrhosis: Multicenter Longitudinal Cohort Study in Taiwan.

JMIR Med Inform. 2025 Aug 6;13:e71229. doi: 10.2196/71229.

How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms.

Malar J. 2025 Jul 24;24(1):242. doi: 10.1186/s12936-025-05479-3.

A multiverse analysis examining measurement factors of potentially traumatic events that influence predictability of developmental functioning among children.

Traumatology (Tallahass Fla). 2025 Mar;31(1):29-44. doi: 10.1037/trm0000502. Epub 2024 Apr 4.

AI-delirium guard: Predictive modeling of postoperative delirium in elderly surgical patients.

PLoS One. 2025 Jun 5;20(6):e0322032. doi: 10.1371/journal.pone.0322032. eCollection 2025.

Predicting anorexia nervosa treatment efficacy: an explainable machine learning approach.

J Eat Disord. 2025 Jun 2;13(1):97. doi: 10.1186/s40337-025-01265-3.

The role of mitochondrial DNA copy number in spent culture medium in predicting outcomes of single blastocyst transfer.

J Assist Reprod Genet. 2025 May 22. doi: 10.1007/s10815-025-03507-4.

Unraveling overoptimism and publication bias in ML-driven science.

Patterns (N Y). 2025 Feb 25;6(4):101185. doi: 10.1016/j.patter.2025.101185. eCollection 2025 Apr 11.

Acoustic Features for Identifying Suicide Risk in Crisis Hotline Callers: Machine Learning Approach.

J Med Internet Res. 2025 Apr 14;27:e67772. doi: 10.2196/67772.

本文引用的文献

Sample Size Requirements for Accurate Estimation of Squared Semi-Partial Correlation Coefficients.

Multivariate Behav Res. 2002 Jan 1;37(1):37-57. doi: 10.1207/S15327906MBR3701_02.

Predicting the required number of training samples.

IEEE Trans Pattern Anal Mach Intell. 1983 Jun;5(6):664-7. doi: 10.1109/tpami.1983.4767459.

CGHpower: exploring sample size calculations for chromosomal copy number experiments.

BMC Bioinformatics. 2010 Jun 17;11:331. doi: 10.1186/1471-2105-11-331.

Learning curves in classification with microarray data.

Semin Oncol. 2010 Feb;37(1):65-8. doi: 10.1053/j.seminoncol.2009.12.002.

Effects of sample size on robustness and prediction accuracy of a prognostic gene signature.

BMC Bioinformatics. 2009 May 16;10:147. doi: 10.1186/1471-2105-10-147.

How large a training set is needed to develop a classifier for microarray data?

Clin Cancer Res. 2008 Jan 1;14(1):108-14. doi: 10.1158/1078-0432.CCR-07-0443.

Sample size planning for statistical power and accuracy in parameter estimation.

Annu Rev Psychol. 2008;59:537-63. doi: 10.1146/annurev.psych.59.103006.093735.

Impact of sample size on the performance of multiple-model pharmacokinetic simulations.

Antimicrob Agents Chemother. 2006 Nov;50(11):3950-2. doi: 10.1128/AAC.00337-06. Epub 2006 Sep 5.

Confidence-based active learning.

IEEE Trans Pattern Anal Mach Intell. 2006 Aug;28(8):1251-61. doi: 10.1109/TPAMI.2006.156.

Sample size/power calculation for case-cohort studies.

Biometrics. 2004 Dec;60(4):1015-24. doi: 10.1111/j.0006-341X.2004.00257.x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

预测分类性能所需的样本量。

Predicting sample size required for classification performance.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献