GPT-4o 作为心肺复苏技能考试评估者的适用性。

Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations.

机构信息

Shengjing Hospital of China Medical University, Shenyang, Liaoning 110004, China; School of Health Management, China Medical University, Shenyang, Liaoning 110122, China.

Department of Thoracic Surgery, Shengjing Hospital of China Medical University, Shenyang, Liaoning 110004, China.

出版信息

Resuscitation. 2024 Nov;204:110404. doi: 10.1016/j.resuscitation.2024.110404. Epub 2024 Sep 28.

DOI:10.1016/j.resuscitation.2024.110404

PMID:39343124

Abstract

AIM

To assess the accuracy and reliability of GPT-4o for scoring examinees' performance on cardiopulmonary resuscitation (CPR) skills tests.

METHODS

This study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o's reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o's accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss' Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.

RESULTS

The mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o's vs. junior experts' scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o's vs. senior experts' scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66-4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00-4.67, 4.29 [0.50]) for the junior and senior experts, respectively.

CONCLUSIONS

GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.

摘要

目的

评估 GPT-4o 在评估心肺复苏（CPR）技能测试考生表现方面的准确性和可靠性。

方法

本研究包括六名经认证可监督国家医师执照考试的专家（三名初级和三名高级），他们审查了 103 名考生的 CPR 技能测试视频。GPT-4o 对所有由专家审查的视频进行了自动评估。专家和 GPT-4o 对四个部分的视频进行评分：患者评估、胸外按压、复苏呼吸和重复操作。专家随后对 GPT-4o 的可靠性进行了 5 分李克特量表（1，完全不可靠；5，完全可靠）评估。使用组内相关系数（前三个部分）和 Fleiss' Kappa（最后一个部分）评估 GPT-4o 的准确性，以评估其评分与专家评分之间的一致性。

结果

当比较 GPT-4o 与初级专家的评分时，患者评估、胸外按压、复苏呼吸和重复操作部分的平均准确性评分为 0.65、0.58、0.60 和 0.31，当比较 GPT-4o 与高级专家的评分时，分别为 0.75、0.65、0.72 和 0.41。对于可靠性，初级和高级专家的中位数李克特量表评分为 4.00（四分位距[IQR] = 3.66-4.33，平均值[标准差] = 3.95 [0.55]）和 4.33（4.00-4.67，4.29 [0.50]）。

结论

GPT-4o 在检查 CPR 技能考试视频方面表现出与高级专家相似的准确性水平。结果表明，在医学检查环境中部署这种大型语言模型具有潜力。

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验