Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy.
Department of Medical Informatics, Amsterdam UMC, University of Amsterdam, the Netherlands.
J Biomed Inform. 2022 Mar;127:103996. doi: 10.1016/j.jbi.2022.103996. Epub 2022 Jan 15.
Interest in Machine Learning applications to tackle clinical and biological problems is increasing. This is driven by promising results reported in many research papers, the increasing number of AI-based software products, and by the general interest in Artificial Intelligence to solve complex problems. It is therefore of importance to improve the quality of machine learning output and add safeguards to support their adoption. In addition to regulatory and logistical strategies, a crucial aspect is to detect when a Machine Learning model is not able to generalize to new unseen instances, which may originate from a population distant to that of the training population or from an under-represented subpopulation. As a result, the prediction of the machine learning model for these instances may be often wrong, given that the model is applied outside its "reliable" space of work, leading to a decreasing trust of the final users, such as clinicians. For this reason, when a model is deployed in practice, it would be important to advise users when the model's predictions may be unreliable, especially in high-stakes applications, including those in healthcare. Yet, reliability assessment of each machine learning prediction is still poorly addressed. Here, we review approaches that can support the identification of unreliable predictions, we harmonize the notation and terminology of relevant concepts, and we highlight and extend possible interrelationships and overlap among concepts. We then demonstrate, on simulated and real data for ICU in-hospital death prediction, a possible integrative framework for the identification of reliable and unreliable predictions. To do so, our proposed approach implements two complementary principles, namely the density principle and the local fit principle. The density principle verifies that the instance we want to evaluate is similar to the training set. The local fit principle verifies that the trained model performs well on training subsets that are more similar to the instance under evaluation. Our work can contribute to consolidating work in machine learning especially in medicine.
人们对应用机器学习解决临床和生物学问题的兴趣日益浓厚。这是因为许多研究论文报告了有前景的结果,越来越多的人工智能软件产品,以及人们普遍对人工智能解决复杂问题的兴趣。因此,提高机器学习输出的质量并增加保障措施以支持其采用非常重要。除了监管和后勤策略外,一个关键方面是检测机器学习模型何时无法推广到新的未见实例,这些实例可能源自与训练人群不同的人群,也可能源自代表性不足的亚人群。因此,由于模型应用于其“可靠”工作范围之外,机器学习模型对这些实例的预测通常可能是错误的,从而导致最终用户(例如临床医生)对其信任度降低。因此,当模型在实际中部署时,重要的是在模型的预测可能不可靠时通知用户,特别是在高风险应用中,包括医疗保健。然而,每个机器学习预测的可靠性评估仍然没有得到很好的解决。在这里,我们回顾了支持识别不可靠预测的方法,协调了相关概念的符号和术语,并强调和扩展了概念之间可能的相互关系和重叠。然后,我们在 ICU 住院内死亡预测的模拟和真实数据上演示了一种用于识别可靠和不可靠预测的可能综合框架。为此,我们提出的方法实现了两个互补的原则,即密度原则和局部拟合原则。密度原则验证我们要评估的实例与训练集相似。局部拟合原则验证训练模型在与评估实例更相似的训练子集中表现良好。我们的工作可以有助于巩固机器学习领域的工作,特别是在医学领域。