Park ChulHyoung, An Min Ho, Hwang Gyubeom, Park Rae Woong, An Juho
Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea.
Center for Biomedical Informatics Research, Ajou University Medical Cencer, Suown, Republic of Korea.
JMIR Med Inform. 2025 Jul 17;13:e68409. doi: 10.2196/68409.
Emergency medicine can benefit from artificial intelligence (AI) due to its unique challenges, such as high patient volume and the need for urgent interventions. However, it remains difficult to assess the applicability of AI systems to real-world emergency medicine practice, which requires not only medical knowledge but also adaptable problem-solving and effective communication skills.
We aimed to evaluate ChatGPT's (OpenAI) performance in comparison to human doctors in simulated emergency medicine settings, using the framework of clinical performance examination and written examinations.
In total, 12 human doctors were recruited to represent the medical professionals. Both ChatGPT and the human doctors were instructed to manage each case like real clinical settings with 12 simulated patients. After the clinical performance examination sessions, the conversation records were evaluated by an emergency medicine professor on history taking, clinical accuracy, and empathy on a 5-point Likert scale. Simulated patients completed a 5-point scale survey including overall comprehensibility, credibility, and concern reduction for each case. In addition, they evaluated whether the doctor they interacted with was similar to a human doctor. An additional evaluation was performed using vignette-based written examinations to assess diagnosis, investigation, and treatment planning. The mean scores from ChatGPT were then compared with those of the human doctors.
ChatGPT scored significantly higher than the physicians in both history-taking (mean score 3.91, SD 0.67 vs mean score 2.67, SD 0.78, P<.001) and empathy (mean score 4.50, SD 0.67 vs mean score 1.75, SD 0.62, P<.001). However, there was no significant difference in clinical accuracy. In the survey conducted with simulated patients, ChatGPT scored higher for concern reduction (mean score 4.33, SD 0.78 vs mean score 3.58, SD 0.90, P=.04). For comprehensibility and credibility, ChatGPT showed better performance, but the difference was not significant. In the similarity assessment score, no significant difference was observed (mean score 3.50, SD 1.78 vs mean score 3.25, SD 1.86, P=.71).
ChatGPT's performance highlights its potential as a valuable adjunct in emergency medicine, demonstrating comparable proficiency in knowledge application, efficiency, and empathetic patient interaction. These results suggest that a collaborative health care model, integrating AI with human expertise, could enhance patient care and outcomes.
急诊医学因其独特的挑战,如高患者流量和对紧急干预的需求,可从人工智能(AI)中受益。然而,评估AI系统在现实世界急诊医学实践中的适用性仍然困难,这不仅需要医学知识,还需要适应性的问题解决能力和有效的沟通技巧。
我们旨在使用临床绩效检查和笔试框架,在模拟急诊医学环境中评估ChatGPT(OpenAI)与人类医生相比的表现。
总共招募了12名人类医生代表医学专业人员。ChatGPT和人类医生都被要求像在真实临床环境中一样处理每个病例,有12名模拟患者。在临床绩效检查环节之后,一位急诊医学教授根据病史采集、临床准确性和同理心,采用5分李克特量表对对话记录进行评估。模拟患者完成了一份5分量表调查,包括每个病例的总体可理解性、可信度和担忧减轻情况。此外,他们评估了与他们互动的医生是否类似于人类医生。还使用基于案例的笔试进行了额外评估,以评估诊断、检查和治疗计划。然后将ChatGPT的平均得分与人类医生的得分进行比较。
ChatGPT在病史采集(平均得分3.91,标准差0.67,而平均得分2.67,标准差0.78,P<0.001)和同理心(平均得分4.50,标准差0.67,而平均得分1.75,标准差0.62,P<0.001)方面得分显著高于医生。然而,临床准确性方面没有显著差异。在对模拟患者进行的调查中,ChatGPT在担忧减轻方面得分更高(平均得分4.33,标准差0.78,而平均得分3.58,标准差0.90,P=0.04)。在可理解性和可信度方面,ChatGPT表现更好,但差异不显著。在相似性评估得分中,未观察到显著差异(平均得分:3.50,标准差1.78,而平均得分3.25,标准差1.86,P=0.71)。
ChatGPT的表现突出了其作为急诊医学中有价值辅助工具的潜力,在知识应用、效率和同理心患者互动方面展现出相当的能力。这些结果表明,将AI与人类专业知识相结合的协作式医疗保健模式可以提高患者护理质量和治疗效果。