Li Jing
Department of Political Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.
PLoS One. 2024 Dec 23;19(12):e0316019. doi: 10.1371/journal.pone.0316019. eCollection 2024.
The proper use of model evaluation metrics is important for model evaluation and model selection in binary classification tasks. This study investigates how consistent different metrics are at evaluating models across data of different prevalence while the relationships between different variables and the sample size are kept constant. Analyzing 156 data scenarios, 18 model evaluation metrics and five commonly used machine learning models as well as a naive random guess model, I find that evaluation metrics that are less influenced by prevalence offer more consistent evaluation of individual models and more consistent ranking of a set of models. In particular, Area Under the ROC Curve (AUC) which takes all decision thresholds into account when evaluating models has the smallest variance in evaluating individual models and smallest variance in ranking of a set of models. A close threshold analysis using all possible thresholds for all metrics further supports the hypothesis that considering all decision thresholds helps reduce the variance in model evaluation with respect to prevalence change in data. The results have significant implications for model evaluation and model selection in binary classification tasks.
在二分类任务中,正确使用模型评估指标对于模型评估和模型选择至关重要。本研究调查了在不同患病率的数据上评估模型时,不同指标在评估模型方面的一致性程度,同时保持不同变量与样本量之间的关系不变。通过分析156个数据场景、18个模型评估指标、五个常用的机器学习模型以及一个简单的随机猜测模型,我发现受患病率影响较小的评估指标对单个模型的评估更一致,对一组模型的排序也更一致。特别是,ROC曲线下面积(AUC)在评估模型时考虑了所有决策阈值,在评估单个模型时方差最小,在一组模型的排序中方差也最小。使用所有指标的所有可能阈值进行的精细阈值分析进一步支持了这样的假设,即考虑所有决策阈值有助于减少因数据患病率变化而导致的模型评估方差。这些结果对二分类任务中的模型评估和模型选择具有重要意义。