受试者工作特征曲线能准确评估不均衡数据集。

The receiver operating characteristic curve accurately assesses imbalanced datasets.

作者信息

Richardson Eve, Trevizani Raphael, Greenbaum Jason A, Carter Hannah, Nielsen Morten, Peters Bjoern

机构信息

Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, CA, USA.

Fiocruz Ceará, Fundação Oswaldo Cruz, Rua São José s/n, Precabura, Eusébio/CE, Brazil.

出版信息

Patterns (N Y). 2024 May 31;5(6):100994. doi: 10.1016/j.patter.2024.100994. eCollection 2024 Jun 14.

Abstract

Many problems in biology require looking for a "needle in a haystack," corresponding to a binary classification where there are a few positives within a much larger set of negatives, which is referred to as a class imbalance. The receiver operating characteristic (ROC) curve and the associated area under the curve (AUC) have been reported as ill-suited to evaluate prediction performance on imbalanced problems where there is more interest in performance on the positive minority class, while the precision-recall (PR) curve is preferable. We show via simulation and a real case study that this is a misinterpretation of the difference between the ROC and PR spaces, showing that the ROC curve is robust to class imbalance, while the PR curve is highly sensitive to class imbalance. Furthermore, we show that class imbalance cannot be easily disentangled from classifier performance measured via PR-AUC.

摘要

生物学中的许多问题都需要在“大海捞针”式的情况下寻找答案,这对应于一种二元分类,即在大量的负样本中存在少量正样本,这被称为类别不平衡。据报道,受试者工作特征(ROC)曲线及相关的曲线下面积(AUC)不适合评估不平衡问题的预测性能,因为在这类问题中,人们更关注少数正类别样本的性能,而精确率-召回率(PR)曲线则更适用。我们通过模拟和实际案例研究表明,这是对ROC和PR空间差异的误解,结果显示ROC曲线对类别不平衡具有鲁棒性,而PR曲线对类别不平衡高度敏感。此外,我们还表明,类别不平衡无法轻易地与通过PR-AUC测量的分类器性能区分开来。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d7e/11240176/3909b126b797/fx1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索