Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, USA.
Stat Methods Med Res. 2018 May;27(5):1394-1409. doi: 10.1177/0962280216661371. Epub 2016 Aug 8.
Scores produced by statistical classifiers in many clinical decision support systems and other medical diagnostic devices are generally on an arbitrary scale, so the clinical meaning of these scores is unclear. Calibration of classifier scores to a meaningful scale such as the probability of disease is potentially useful when such scores are used by a physician. In this work, we investigated three methods (parametric, semi-parametric, and non-parametric) for calibrating classifier scores to the probability of disease scale and developed uncertainty estimation techniques for these methods. We showed that classifier scores on arbitrary scales can be calibrated to the probability of disease scale without affecting their discrimination performance. With a finite dataset to train the calibration function, it is important to accompany the probability estimate with its confidence interval. Our simulations indicate that, when a dataset used for finding the transformation for calibration is also used for estimating the performance of calibration, the resubstitution bias exists for a performance metric involving the truth states in evaluating the calibration performance. However, the bias is small for the parametric and semi-parametric methods when the sample size is moderate to large (>100 per class).
在许多临床决策支持系统和其他医学诊断设备中的统计分类器产生的分数通常是任意的,因此这些分数的临床意义不明确。当医生使用这些分数时,将分类器分数校准到有意义的尺度(如疾病的概率)是很有用的。在这项工作中,我们研究了将分类器分数校准到疾病概率尺度的三种方法(参数、半参数和非参数),并为这些方法开发了不确定性估计技术。我们表明,在不影响其判别性能的情况下,可以将任意尺度上的分类器分数校准到疾病概率尺度。在有限的数据集上训练校准函数时,重要的是要为概率估计提供置信区间。我们的模拟表明,当用于查找校准转换的数据集也用于估计校准的性能时,对于涉及评估校准性能的真实状态的性能指标,存在再抽样偏差。然而,当样本量适中到较大(每个类别>100 个)时,参数和半参数方法的偏差较小。