Warwick Medical School, University of Warwick, Coventry, United Kingdom.
University Hospitals Coventry and Warwickshire, Coventry, United Kingdom.
PLoS One. 2024 Feb 14;19(2):e0297701. doi: 10.1371/journal.pone.0297701. eCollection 2024.
INTRODUCTION: ChatGPT, a sophisticated large language model (LLM), has garnered widespread attention for its ability to mimic human-like communication. As recent studies indicate a potential supportive role of ChatGPT in academic writing, we assessed the LLM's capacity to generate accurate and comprehensive scientific abstracts from published Randomised Controlled Trial (RCT) data, focusing on the adherence to the Consolidated Standards of Reporting Trials for Abstracts (CONSORT-A) statement, in comparison to the original authors' abstracts. METHODOLOGY: RCTs, identified in a PubMed/MEDLINE search post-September 2021 across various medical disciplines, were subjected to abstract generation via ChatGPT versions 3.5 and 4, following the guidelines of the respective journals. The overall quality score (OQS) of each abstract was determined by the total number of adequately reported components from the 18-item CONSORT-A checklist. Additional outcome measures included percent adherence to each CONOSORT-A item, readability, hallucination rate, and regression analysis of reporting quality determinants. RESULTS: Original abstracts achieved a mean OQS of 11.89 (95% CI: 11.23-12.54), outperforming GPT 3.5 (7.89; 95% CI: 7.32-8.46) and GPT 4 (5.18; 95% CI: 4.64-5.71). Compared to GPT 3.5 and 4 outputs, original abstracts were more adherent with 10 and 14 CONSORT-A items, respectively. In blind assessments, GPT 3.5-generated abstracts were deemed most readable in 62.22% of cases which was significantly greater than the original (31.11%; P = 0.003) and GPT 4-generated (6.67%; P<0.001) abstracts. Moreover, ChatGPT 3.5 exhibited a hallucination rate of 0.03 items per abstract compared to 1.13 by GPT 4. No determinants for improved reporting quality were identified for GPT-generated abstracts. CONCLUSIONS: While ChatGPT could generate more readable abstracts, their overall quality was inferior to the original abstracts. Yet, its proficiency to concisely relay key information with minimal error holds promise for medical research and warrants further investigations to fully ascertain the LLM's applicability in this domain.
简介:ChatGPT 是一种功能强大的大型语言模型(LLM),因其具有模仿人类交流的能力而备受关注。最近的研究表明,ChatGPT 在学术写作中可能具有支持作用,因此我们评估了该语言模型从已发表的随机对照试验(RCT)数据中生成准确、全面的科学摘要的能力,重点评估其对 CONSORT-A 声明的遵循程度,与原始作者的摘要进行比较。
方法:在 2021 年 9 月之后,通过 PubMed/MEDLINE 搜索在多个医学领域中确定 RCT,并按照各期刊的指南使用 ChatGPT 版本 3.5 和 4 生成摘要。每个摘要的总体质量评分(OQS)通过 CONSORT-A 清单的 18 个项目中充分报告的组件数量来确定。其他结果指标包括每个 CONSORT-A 项目的依从百分比、可读性、幻觉率以及报告质量决定因素的回归分析。
结果:原始摘要的平均 OQS 为 11.89(95%CI:11.23-12.54),优于 GPT 3.5(7.89;95%CI:7.32-8.46)和 GPT 4(5.18;95%CI:4.64-5.71)。与 GPT 3.5 和 4 的输出相比,原始摘要对 10 项和 14 项 CONSORT-A 项目的依从性更高。在盲法评估中,GPT 3.5 生成的摘要在 62.22%的情况下被认为最易读,明显高于原始摘要(31.11%;P=0.003)和 GPT 4 生成的摘要(6.67%;P<0.001)。此外,GPT 3.5 的幻觉率为每篇摘要 0.03 项,而 GPT 4 的幻觉率为 1.13 项。没有发现 GPT 生成的摘要在报告质量方面有任何改进的决定因素。
结论:虽然 ChatGPT 可以生成更易读的摘要,但它们的整体质量不如原始摘要。然而,它能够简洁地传达关键信息且错误较少,这为医学研究带来了希望,并需要进一步研究以充分确定该语言模型在该领域的适用性。
Am J Obstet Gynecol. 2024-8
Cochrane Database Syst Rev. 2022-2-1
Front Sports Act Living. 2025-7-29
bioRxiv. 2025-3-10
J Med Internet Res. 2024-12-23
Korean J Physiol Pharmacol. 2024-9-1
JMIR Med Inform. 2024-7-25
Emerg Med Australas. 2023-10
Lancet Digit Health. 2023-6
Arch Soc Esp Oftalmol (Engl Ed). 2023-5
Diabetes Metab Syndr. 2023-4
Aesthet Surg J. 2023-7-15