Petersen Eike, Holm Sune, Ganz Melanie, Feragen Aasa
DTU Compute, Technical University of Denmark, Richard Pedersens Plads, 2800 Kgs. Lyngby, Denmark.
Pioneer Centre for AI, Øster Voldgade 3, 1350 Copenhagen, Denmark.
Patterns (N Y). 2023 Jul 14;4(7):100790. doi: 10.1016/j.patter.2023.100790.
To ensure equitable quality of care, differences in machine learning model performance between patient groups must be addressed. Here, we argue that two separate mechanisms can cause performance differences between groups. First, model performance may be worse than theoretically achievable in a given group. This can occur due to a combination of group underrepresentation, modeling choices, and the characteristics of the prediction task at hand. We examine scenarios in which underrepresentation leads to underperformance, scenarios in which it does not, and the differences between them. Second, the optimal achievable performance may also differ between groups due to differences in the intrinsic difficulty of the prediction task. We discuss several possible causes of such differences in task difficulty. In addition, challenges such as label biases and selection biases may confound both learning and performance evaluation. We highlight consequences for the path toward equal performance, and we emphasize that leveling model performance may require gathering not only data from underperforming groups but also data. Throughout, we ground our discussion in real-world medical phenomena and case studies while also referencing relevant statistical theory.
为确保医疗服务质量的公平性,必须解决不同患者群体之间机器学习模型性能的差异问题。在此,我们认为有两种不同的机制会导致群体间的性能差异。首先,模型在给定群体中的性能可能比理论上可达到的性能更差。这可能是由于群体代表性不足、建模选择以及手头预测任务的特征等多种因素共同作用的结果。我们研究了代表性不足导致性能不佳的情况、未导致性能不佳的情况以及它们之间的差异。其次,由于预测任务的内在难度不同,不同群体的最佳可实现性能也可能不同。我们讨论了任务难度存在此类差异的几种可能原因。此外,标签偏差和选择偏差等挑战可能会混淆学习和性能评估。我们强调了实现平等性能道路上的后果,并强调平衡模型性能可能不仅需要收集表现不佳群体的数据,还需要收集其他数据。在整个讨论过程中,我们以实际的医学现象和案例研究为基础,同时也参考了相关的统计理论。