Ghanem Diane, Zhu Alexander R, Kagabo Whitney, Osgood Greg, Shafiq Babar
Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland.
School of Medicine, The Johns Hopkins University, Baltimore, Maryland.
JB JS Open Access. 2024 Sep 5;9(3). doi: 10.2106/JBJS.OA.24.00099. eCollection 2024 Jul-Sep.
The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery.
Two independent reviewers critically assessed 30 ChatGPT-4-generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4-generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used.
ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed "true" while 56.7% were categorized as "false" (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437).
With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references-a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.
人工智能语言模型Chat生成式预训练变换器(ChatGPT)已显示出作为骨科手术中可靠且易于获取的教育资源的潜力。然而,所提供信息背后参考文献的准确性仍然难以捉摸,这对维护医学内容的完整性构成了担忧。本研究旨在检验ChatGPT-4提供的关于创伤手术中气道、呼吸、循环、残疾、暴露(ABCDE)方法的参考文献的准确性。
两名独立审阅者严格评估了30篇由ChatGPT-4生成的支持成熟的创伤方案ABCDE方法的参考文献,将它们评为0(不存在)、1(不准确)或2(准确)。仔细审查并加粗了ChatGPT-4参考文献与PubMed参考文献之间的所有差异。使用科恩卡方系数来检验审阅者之间ChatGPT-4生成参考文献准确性评分的一致性。使用描述性统计来总结平均参考文献准确性评分。为了比较5个类别中均值的方差,使用了单因素方差分析。
ChatGPT-4的平均参考文献准确性评分为66.7%。在30篇参考文献中,只有43.3%是准确的并被视为“真”,而56.7%被归类为“假”(43.3%不准确,13.3%不存在)。5个创伤方案类别中的准确性是一致的,无显著统计学差异(p = 0.437)。
由于57%的参考文献不准确或不存在,ChatGPT-4在提供可靠且可重复的参考文献方面存在不足——这是一个令人担忧的发现,因为在未经彻底验证的情况下将ChatGPT-4用于专业医疗决策的安全性。只有谨慎使用并进行交叉引用,这个语言模型才能作为一种辅助学习工具,增强全面性以及知识演练和运用。