Cheng Adam, Nagesh Vikhashni, Eller Susan, Grant Vincent, Lin Yiqun
From the KidSIM Simulation Program (A.C.), Alberta Children's Hospital, Departments of Pediatrics and Emergency Medicine, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Pediatrics (V.N.), Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Center for Immersive and Simulation-Based Learning (S.E.), Stanford School of Medicine, Stanford, CA; Departments of Pediatrics and Emergency Medicine (V.G.), Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; and KidSIM Simulation Program (Y.L.), Alberta Children's Hospital, Calgary, Alberta, Canada.
Simul Healthc. 2025 Aug 7. doi: 10.1097/SIH.0000000000000877.
Large language model-based generative AI tools, such as the Chat Generative Pre-trained Transformer (ChatGPT) platform, have been used to assist with writing academic manuscripts. Little is known about ChatGPT's ability to accurately cite relevant references in health care simulation-related scholarly manuscripts. In this study, we sought to: (1) determine the reference accuracy and citation relevance among health care simulation debriefing articles generated by 2 different models of ChatGPT and (2) determine if ChatGPT models can be trained with specific prompts to improve reference accuracy and citation relevance.
The ChatGPT-4 and ChatGPT o1 models were asked to generate scholarly articles with appropriate references based upon three different article titles about health care simulation debriefing. Five articles with references were generated for each article title-3 ChatGPT-4 training conditions and 2 ChatGPT o1 training conditions. Each article was assessed independently by 2 blinded reviewers for reference accuracy and citation relevance.
Fifteen articles were generated in total: 9 articles by ChatGPT-4 and 6 articles by ChatGPT o1. A total of 60.4% of the 303 references generated across 5 training conditions were classified as accurate, with no significant difference in reference accuracy between the 5 conditions. A total of 22.2% of the 451 citations were classified as highly relevant, with no significant difference in citation relevance across the 5 conditions.
Among debriefing articles generated by ChatGPT-4 and ChatGPT o1, both ChatGPT models are unreliable with respect to reference accuracy and citation relevance. Reference accuracy and citation relevance for debriefing articles do not improve even with some degree of training built into ChatGPT prompts.
基于大语言模型的生成式人工智能工具,如聊天生成预训练变换器(ChatGPT)平台,已被用于协助撰写学术手稿。关于ChatGPT在医疗模拟相关学术手稿中准确引用相关参考文献的能力,我们知之甚少。在本研究中,我们旨在:(1)确定由两种不同型号的ChatGPT生成的医疗模拟汇报文章中的参考文献准确性和引用相关性;(2)确定ChatGPT模型是否可以通过特定提示进行训练,以提高参考文献准确性和引用相关性。
要求ChatGPT-4和ChatGPT o1模型根据关于医疗模拟汇报的三个不同文章标题生成带有适当参考文献的学术文章。每个文章标题生成五篇带有参考文献的文章——3种ChatGPT-4训练条件和2种ChatGPT o1训练条件。由2名盲法评审员独立评估每篇文章的参考文献准确性和引用相关性。
共生成15篇文章:ChatGPT-4生成9篇,ChatGPT o1生成6篇。在5种训练条件下生成的303条参考文献中,共有60.4%被分类为准确,5种条件下的参考文献准确性无显著差异。在451条引用中,共有22.2%被分类为高度相关,5种条件下的引用相关性无显著差异。
在由ChatGPT-4和ChatGPT o1生成的汇报文章中,两种ChatGPT模型在参考文献准确性和引用相关性方面都不可靠。即使在ChatGPT提示中加入一定程度的训练,汇报文章的参考文献准确性和引用相关性也不会提高。