Suppr超能文献

表格临床数据中常用分类算法的样本量要求:实证研究

Sample Size Requirements for Popular Classification Algorithms in Tabular Clinical Data: Empirical Study.

作者信息

Silvey Scott, Liu Jinze

机构信息

Department of Biostatistics, School of Public Health, Virginia Commonwealth University, Richmond, VA, United States.

出版信息

J Med Internet Res. 2024 Dec 17;26:e60231. doi: 10.2196/60231.

Abstract

BACKGROUND

The performance of a classification algorithm eventually reaches a point of diminishing returns, where the additional sample added does not improve the results. Thus, there is a need to determine an optimal sample size that maximizes performance while accounting for computational burden or budgetary concerns.

OBJECTIVE

This study aimed to determine optimal sample sizes and the relationships between sample size and dataset-level characteristics over a variety of binary classification algorithms.

METHODS

A total of 16 large open-source datasets were collected, each containing a binary clinical outcome. Furthermore, 4 machine learning algorithms were assessed: XGBoost (XGB), random forest (RF), logistic regression (LR), and neural networks (NNs). For each dataset, the cross-validated area under the curve (AUC) was calculated at increasing sample sizes, and learning curves were fit. Sample sizes needed to reach the observed full-dataset AUC minus 2 points (0.02) were calculated from the fitted learning curves and compared across the datasets and algorithms. Dataset-level characteristics, minority class proportion, full-dataset AUC, number of features, type of features, and degree of nonlinearity were examined. Negative binomial regression models were used to quantify relationships between these characteristics and expected sample sizes within each algorithm. A total of 4 multivariable models were constructed, which selected the best-fitting combination of dataset-level characteristics.

RESULTS

Among the 16 datasets (full-dataset sample sizes ranging from 70,000-1,000,000), median sample sizes were 9960 (XGB), 3404 (RF), 696 (LR), and 12,298 (NN) to reach AUC stability. For all 4 algorithms, more balanced classes (multiplier: 0.93-0.96 for a 1% increase in minority class proportion) were associated with decreased sample size. Other characteristics varied in importance across algorithms-in general, more features, weaker features, and more complex relationships between the predictors and the response increased expected sample sizes. In multivariable analysis, the top selected predictors were minority class proportion among all 4 algorithms assessed, full-dataset AUC (XGB, RF, and NN), and dataset nonlinearity (XGB, RF, and NN). For LR, the top predictors were minority class proportion, percentage of strong linear features, and number of features. Final multivariable sample size models had high goodness-of-fit, with dataset-level predictors explaining a majority (66.5%-84.5%) of the total deviance in the data among all 4 models.

CONCLUSIONS

The sample sizes needed to reach AUC stability among 4 popular classification algorithms vary by dataset and method and are associated with dataset-level characteristics that can be influenced or estimated before the start of a research study.

摘要

背景

分类算法的性能最终会达到收益递减点,即增加的样本并不会改善结果。因此,需要确定一个最优样本量,在考虑计算负担或预算问题的同时使性能最大化。

目的

本研究旨在确定多种二元分类算法的最优样本量以及样本量与数据集层面特征之间的关系。

方法

共收集了16个大型开源数据集,每个数据集都包含一个二元临床结局。此外,评估了4种机器学习算法:极端梯度提升(XGBoost,XGB)、随机森林(RF)、逻辑回归(LR)和神经网络(NNs)。对于每个数据集,在样本量增加时计算交叉验证曲线下面积(AUC),并拟合学习曲线。根据拟合的学习曲线计算达到观察到的完整数据集AUC减去2个百分点(0.02)所需的样本量,并在各数据集和算法之间进行比较。研究了数据集层面的特征、少数类比例、完整数据集AUC、特征数量、特征类型和非线性程度。使用负二项回归模型来量化这些特征与每种算法内预期样本量之间的关系。共构建了4个多变量模型,这些模型选择了数据集层面特征的最佳拟合组合。

结果

在16个数据集中(完整数据集样本量范围为70,000 - 1,000,000),达到AUC稳定性的样本量中位数分别为:XGB为9960、RF为3404、LR为696、NNs为12,298。对于所有4种算法,更平衡的类别(少数类比例每增加1%,乘数为0.93 - 0.96)与样本量减少相关。其他特征在不同算法中的重要性各不相同——一般来说,更多特征(原文此处有误,应为更强特征)、较弱特征以及预测变量与响应之间更复杂的关系会增加预期样本量。在多变量分析中,所有4种评估算法中最主要的预测因素是少数类比例,对于XGB、RF和NNs是完整数据集AUC,对于XGB、RF和NNs是数据集非线性。对于LR,主要预测因素是少数类比例、强线性特征的百分比和特征数量。最终的多变量样本量模型具有良好的拟合度,在所有4个模型中,数据集层面的预测因素解释了数据中总偏差的大部分(66.5% - 84.5%)。

结论

4种常用分类算法达到AUC稳定性所需的样本量因数据集和方法而异,并且与数据集层面的特征相关,这些特征在研究开始前可以受到影响或进行估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b36c/11688588/ee6deb130aa0/jmir_v26i1e60231_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验