Whitney Heather M, Drukker Karen, Giger Maryellen L
University of Chicago, Department of Radiology, Chicago, Illinois, United States.
Wheaton College, Department of Physics, Wheaton, Illinois, United States.
J Med Imaging (Bellingham). 2022 May;9(3):035502. doi: 10.1117/1.JMI.9.3.035502. Epub 2022 May 31.
The aim of this study is to (1) demonstrate a graphical method and interpretation framework to extend performance evaluation beyond receiver operating characteristic curve analysis and (2) assess the impact of disease prevalence and variability in training and testing sets, particularly when a specific operating point is used. The proposed performance metric curves (PMCs) simultaneously assess sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), and the 95% confidence intervals thereof, as a function of the threshold for the decision variable. We investigated the utility of PMCs using six example operating points associated with commonly used methods to select operating points (including the Youden index and maximum mutual information). As an example, we applied PMCs to the task of distinguishing between malignant and benign breast lesions using human-engineered radiomic features extracted from dynamic contrast-enhanced magnetic resonance images. The dataset had 1885 lesions, with the images acquired in 2015 and 2016 serving as the training set (1450 lesions) and those acquired in 2017 as the test set (435 lesions). Our study used this dataset in two ways: (1) the clinical dataset itself and (2) simulated datasets with features based on the clinical set but with five different disease prevalences. The median and 95% CI of the number of type I (false positive) and type II (false negative) errors were determined for each operating point of interest. PMCs from both the clinical and simulated datasets demonstrated that PMCs could support interpretation of the impact of decision threshold choice on type I and type II errors of classification, particularly relevant to prevalence. PMCs allow simultaneous evaluation of the four performance metrics of sensitivity, specificity, PPV, and NPV as a function of the decision threshold. This may create a better understanding of two-class classifier performance in machine learning.
(1)展示一种图形方法和解释框架,以将性能评估扩展到受试者工作特征曲线分析之外;(2)评估疾病患病率以及训练集和测试集中的变异性的影响,特别是在使用特定操作点时。所提出的性能指标曲线(PMC)同时评估敏感性、特异性、阳性预测值(PPV)和阴性预测值(NPV)及其95%置信区间,作为决策变量阈值的函数。我们使用与常用的选择操作点的方法(包括尤登指数和最大互信息)相关的六个示例操作点,研究了PMC的效用。例如,我们将PMC应用于使用从动态对比增强磁共振图像中提取的人工设计的放射组学特征来区分乳腺恶性和良性病变的任务。该数据集有1885个病变,2015年和2016年采集的图像用作训练集(1450个病变),2017年采集的图像用作测试集(435个病变)。我们的研究以两种方式使用该数据集:(1)临床数据集本身;(2)基于临床数据集但具有五种不同疾病患病率的模拟数据集。确定了每个感兴趣操作点的I型(假阳性)和II型(假阴性)错误数量的中位数和95%CI。来自临床和模拟数据集的PMC均表明,PMC可以支持解释决策阈值选择对分类的I型和II型错误的影响,这与患病率特别相关。PMC允许同时评估敏感性、特异性、PPV和NPV这四个性能指标作为决策阈值的函数。这可能会更好地理解机器学习中的二类分类器性能。