Marra Andrew
Clinical Biostatistician at GE Healthcare, Chicago, IL, USA.
BioData Min. 2024 Oct 23;17(1):43. doi: 10.1186/s13040-024-00402-z.
In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.
Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.
Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.
在医疗设备验证和确认研究中,尽管有多项报告指出其局限性,但受试者工作特征曲线下面积(AUROC)仍经常被用作主要终点指标。因此,鼓励研究人员考虑将替代指标作为主要终点指标。本文提出了一种名为G4的新指标,它是灵敏度、特异度、阳性预测值和阴性预测值的几何平均值。G4是平衡指标家族的一部分,该家族包括统一性能度量(也称为P4)和马修斯相关系数(MCC)。本文的目的是揭示在分析二元分类器的整体性能时,将G4与平衡指标家族一起使用的好处。
在多读者多病例研究设计下,分析了包含不同少数类患病率的模拟数据集。此外,还考虑了一项独立发表的研究数据,该研究测试了一种独特的超声人工智能算法在乳腺癌检测中的性能。在每个数据集中,除了平衡指标家族外,还报告了AUROC以供比较。当少数类别的数据集患病率和偏差接近50%时,所有三个平衡指标对人工智能性能的解释是等效的。随着患病率的增加/减少以及数据变得更加不平衡,AUROC往往高估/低估了真正的分类器性能,而平衡指标家族对这种不平衡具有抗性。在某些数据不平衡严重(少数类患病率<10%)的情况下,MCC更适合单独评估,而P4在评估组间分析时提供更强的效应量。G4在最大化单独评估和组间分析方面起到了中间作用。
在二元分类问题中,将AUROC用作主要终点指标会在数据集变得更加不平衡时产生误导性结果。在医疗设备验证和确认研究中纳入AUROC时,这一点尤为明显。G4、P4和MCC不存在此局限性,并且能更全面地反映医疗设备在临床环境中的性能。因此,鼓励研究人员在评估二元分类问题时探索平衡指标家族。