Shengjing Hospital of China Medical University, Shenyang, Liaoning 110004, China; School of Health Management, China Medical University, Shenyang, Liaoning 110122, China.
Department of Thoracic Surgery, Shengjing Hospital of China Medical University, Shenyang, Liaoning 110004, China.
Resuscitation. 2024 Nov;204:110404. doi: 10.1016/j.resuscitation.2024.110404. Epub 2024 Sep 28.
To assess the accuracy and reliability of GPT-4o for scoring examinees' performance on cardiopulmonary resuscitation (CPR) skills tests.
This study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o's reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o's accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss' Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.
The mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o's vs. junior experts' scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o's vs. senior experts' scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66-4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00-4.67, 4.29 [0.50]) for the junior and senior experts, respectively.
GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.
评估 GPT-4o 在评估心肺复苏(CPR)技能测试考生表现方面的准确性和可靠性。
本研究包括六名经认证可监督国家医师执照考试的专家(三名初级和三名高级),他们审查了 103 名考生的 CPR 技能测试视频。GPT-4o 对所有由专家审查的视频进行了自动评估。专家和 GPT-4o 对四个部分的视频进行评分:患者评估、胸外按压、复苏呼吸和重复操作。专家随后对 GPT-4o 的可靠性进行了 5 分李克特量表(1,完全不可靠;5,完全可靠)评估。使用组内相关系数(前三个部分)和 Fleiss' Kappa(最后一个部分)评估 GPT-4o 的准确性,以评估其评分与专家评分之间的一致性。
当比较 GPT-4o 与初级专家的评分时,患者评估、胸外按压、复苏呼吸和重复操作部分的平均准确性评分为 0.65、0.58、0.60 和 0.31,当比较 GPT-4o 与高级专家的评分时,分别为 0.75、0.65、0.72 和 0.41。对于可靠性,初级和高级专家的中位数李克特量表评分为 4.00(四分位距[IQR] = 3.66-4.33,平均值[标准差] = 3.95 [0.55])和 4.33(4.00-4.67,4.29 [0.50])。
GPT-4o 在检查 CPR 技能考试视频方面表现出与高级专家相似的准确性水平。结果表明,在医学检查环境中部署这种大型语言模型具有潜力。