Dipartimento di Informatica, Sistemistica e Comunicazione, Università di Milano-Bicocca, Milano, Italy.
Polito(BIO)Med Lab, Politecnico di Torino, Torino, Italy; USE-ME-D srl, I3P Politecnico di Torino, Torino, Ital.
Comput Methods Programs Biomed. 2022 Jun;221:106930. doi: 10.1016/j.cmpb.2022.106930. Epub 2022 Jun 3.
Background and Objective Evaluation of AI-based decision support systems (AI-DSS) is of critical importance in practical applications, nonetheless common evaluation metrics fail to properly consider relevant and contextual information. In this article we discuss a novel utility metric, the weighted Utility (wU), for the evaluation of AI-DSS, which is based on the raters' perceptions of their annotation hesitation and of the relevance of the training cases. Methods We discuss the relationship between the proposed metric and other previous proposals; and we describe the application of the proposed metric for both model evaluation and optimization, through three realistic case studies. Results We show that our metric generalizes the well-known Net Benefit, as well as other common error-based and utility-based metrics. Through the empirical studies, we show that our metric can provide a more flexible tool for the evaluation of AI models. We also show that, compared to other optimization metrics, model optimization based on the wU can provide significantly better performance (AUC 0.862 vs 0.895, p-value <0.05), especially on cases judged to be more complex by the human annotators (AUC 0.85 vs 0.92, p-value <0.05). Conclusions We make the point for having utility as a primary concern in the evaluation and optimization of machine learning models in critical domains, like the medical one; and for the importance of a human-centred approach to assess the potential impact of AI models on human decision making also on the basis of further information that can be collected during the ground-truthing process.
背景与目的 在实际应用中,基于人工智能的决策支持系统(AI-DSS)的评估至关重要,然而常见的评估指标未能充分考虑相关和上下文信息。本文讨论了一种新的效用度量——加权效用(wU),用于评估 AI-DSS,该度量基于评估者对标注犹豫和训练案例相关性的感知。 方法 我们讨论了所提出的度量与其他先前建议之间的关系;并通过三个实际案例研究,描述了该度量在模型评估和优化中的应用。 结果 我们表明,我们的度量概括了著名的净收益以及其他常见的基于错误和基于效用的度量。通过实证研究,我们表明,我们的度量可以为 AI 模型的评估提供更灵活的工具。我们还表明,与其他优化度量相比,基于 wU 的模型优化可以提供显著更好的性能(AUC 0.862 与 0.895,p 值<0.05),特别是在人类标注者认为更复杂的情况下(AUC 0.85 与 0.92,p 值<0.05)。 结论 我们认为,在医学等关键领域,效用应作为机器学习模型评估和优化的首要关注点;并且应采取以人为中心的方法,根据在实地核实过程中可以收集到的其他信息,评估 AI 模型对人类决策的潜在影响。