Nagendran Myura, Festor Paul, Komorowski Matthieu, Gordon Anthony C, Faisal Aldo A
UKRI Centre for Doctoral Training in AI for Healthcare, Imperial College London, London, UK.
Division of Anaesthetics, Pain Medicine, and Intensive Care, Imperial College London, London, UK.
NPJ Digit Med. 2024 Aug 2;7(1):202. doi: 10.1038/s41746-024-01200-x.
We studied clinical AI-supported decision-making as an example of a high-stakes setting in which explainable AI (XAI) has been proposed as useful (by theoretically providing physicians with context for the AI suggestion and thereby helping them to reject unsafe AI recommendations). Here, we used objective neurobehavioural measures (eye-tracking) to see how physicians respond to XAI with N = 19 ICU physicians in a hospital's clinical simulation suite. Prescription decisions were made both pre- and post-reveal of either a safe or unsafe AI recommendation and four different types of simultaneously presented XAI. We used overt visual attention as a marker for where physician mental attention was directed during the simulations. Unsafe AI recommendations attracted significantly greater attention than safe AI recommendations. However, there was no appreciably higher level of attention placed onto any of the four types of explanation during unsafe AI scenarios (i.e. XAI did not appear to 'rescue' decision-makers). Furthermore, self-reported usefulness of explanations by physicians did not correlate with the level of attention they devoted to the explanations reinforcing the notion that using self-reports alone to evaluate XAI tools misses key aspects of the interaction behaviour between human and machine.
我们以临床人工智能支持的决策为例,研究了高风险环境中的情况,在这种环境下,可解释人工智能(XAI)被认为是有用的(理论上为医生提供人工智能建议的背景信息,从而帮助他们拒绝不安全的人工智能推荐)。在此,我们采用客观的神经行为测量方法(眼动追踪),以一家医院临床模拟套件中的19名重症监护室医生为对象,观察他们对可解释人工智能的反应。在揭示安全或不安全的人工智能推荐以及同时呈现的四种不同类型的可解释人工智能之前和之后,都要做出处方决策。我们将明显的视觉注意力作为模拟过程中医生心理注意力指向位置的一个指标。不安全的人工智能推荐比安全的人工智能推荐吸引了显著更多的注意力。然而,在不安全的人工智能场景中,对于四种解释类型中的任何一种,都没有明显更高的注意力水平(即可解释人工智能似乎没有“拯救”决策者)。此外,医生自我报告的解释有用性与他们对解释投入的注意力水平不相关,这强化了仅使用自我报告来评估可解释人工智能工具会忽略人机交互行为关键方面的观点。