支持人工智能对临床技能进行评估的有效性证据,与经过培训的临床医生评分者相比。
Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters.
机构信息
Center for Fetal Medicine, Department of Obstetrics, Copenhagen University Hospital, Rigshospitalet, Copenhagen, Denmark.
Faculty of Health and Medical Science, University of Copenhagen, Copenhagen, Denmark.
出版信息
Med Educ. 2024 Jan;58(1):105-117. doi: 10.1111/medu.15190. Epub 2023 Aug 24.
BACKGROUND
Artificial intelligence (AI) is becoming increasingly used in medical education, but our understanding of the validity of AI-based assessments (AIBA) as compared with traditional clinical expert-based assessments (EBA) is limited. In this study, the authors aimed to compare and contrast the validity evidence for the assessment of a complex clinical skill based on scores generated from an AI and trained clinical experts, respectively.
METHODS
The study was conducted between September 2020 to October 2022. The authors used Kane's validity framework to prioritise and organise their evidence according to the four inferences: scoring, generalisation, extrapolation and implications. The context of the study was chorionic villus sampling performed within the simulated setting. AIBA and EBA were used to evaluate performances of experts, intermediates and novice based on video recordings. The clinical experts used a scoring instrument developed in a previous international consensus study. The AI used convolutional neural networks for capturing features on video recordings, motion tracking and eye movements to arrive at a final composite score.
RESULTS
A total of 45 individuals participated in the study (22 novices, 12 intermediates and 11 experts). The authors demonstrated validity evidence for scoring, generalisation, extrapolation and implications for both EBA and AIBA. The plausibility of assumptions related to scoring, evidence of reproducibility and relation to different training levels was examined. Issues relating to construct underrepresentation, lack of explainability, and threats to robustness were identified as potential weak links in the AIBA validity argument compared with the EBA validity argument.
CONCLUSION
There were weak links in the use of AIBA compared with EBA, mainly in their representation of the underlying construct but also regarding their explainability and ability to transfer to other datasets. However, combining AI and clinical expert-based assessments may offer complementary benefits, which is a promising subject for future research.
背景
人工智能(AI)在医学教育中的应用越来越广泛,但我们对基于 AI 的评估(AIBA)与传统临床专家评估(EBA)的有效性的理解有限。在这项研究中,作者旨在比较和对比基于 AI 和经过培训的临床专家分别生成的分数评估复杂临床技能的有效性证据。
方法
该研究于 2020 年 9 月至 2022 年 10 月进行。作者使用凯恩的有效性框架根据四个推断:评分、推广、推断和影响,对他们的证据进行优先排序和组织。研究的背景是在模拟环境中进行的绒毛膜绒毛取样。AIBA 和 EBA 用于根据视频记录评估专家、中级和新手的表现。临床专家使用之前的国际共识研究中开发的评分工具。AI 使用卷积神经网络来捕获视频记录、运动跟踪和眼动的特征,以得出最终的综合评分。
结果
共有 45 人参与了这项研究(22 名新手、12 名中级和 11 名专家)。作者展示了 EBA 和 AIBA 的评分、推广、推断和影响的有效性证据。对与评分相关的假设的合理性、可重复性证据以及与不同培训水平的关系进行了检查。与 EBA 有效性论证相比,在 AIBA 有效性论证中,与构建代表性不足、缺乏可解释性和稳健性威胁相关的问题被认为是潜在的弱点。
结论
与 EBA 相比,AIBA 的使用存在弱点,主要是在其对基础构建的代表性方面,但也在其可解释性和向其他数据集转移的能力方面。然而,将人工智能和临床专家评估相结合可能会带来互补的好处,这是未来研究的一个有前途的课题。