• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

外科住院医师和资深外科医生撰写的医学研究摘要与大型语言模型生成的摘要的比较。

Comparison of Medical Research Abstracts Written by Surgical Trainees and Senior Surgeons or Generated by Large Language Models.

机构信息

Division of Gastrointestinal and Minimally Invasive Surgery, Department of Surgery, Atrium Health Carolinas Medical Center, Charlotte, North Carolina.

Department of Economics, Massachusetts Institute of Technology, Cambridge.

出版信息

JAMA Netw Open. 2024 Aug 1;7(8):e2425373. doi: 10.1001/jamanetworkopen.2024.25373.

DOI:10.1001/jamanetworkopen.2024.25373
PMID:39093561
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11297395/
Abstract

IMPORTANCE

Artificial intelligence (AI) has permeated academia, especially OpenAI Chat Generative Pretrained Transformer (ChatGPT), a large language model. However, little has been reported on its use in medical research.

OBJECTIVE

To assess a chatbot's capability to generate and grade medical research abstracts.

DESIGN, SETTING, AND PARTICIPANTS: In this cross-sectional study, ChatGPT versions 3.5 and 4.0 (referred to as chatbot 1 and chatbot 2) were coached to generate 10 abstracts by providing background literature, prompts, analyzed data for each topic, and 10 previously presented, unassociated abstracts to serve as models. The study was conducted between August 2023 and February 2024 (including data analysis).

EXPOSURE

Abstract versions utilizing the same topic and data were written by a surgical trainee or a senior physician or generated by chatbot 1 and chatbot 2 for comparison. The 10 training abstracts were written by 8 surgical residents or fellows, edited by the same senior surgeon, at a high-volume hospital in the Southeastern US with an emphasis on outcomes-based research. Abstract comparison was then based on 10 abstracts written by 5 surgical trainees within the first 6 months of their research year, edited by the same senior author.

MAIN OUTCOMES AND MEASURES

The primary outcome measurements were the abstract grades using 10- and 20-point scales and ranks (first to fourth). Abstract versions by chatbot 1, chatbot 2, junior residents, and the senior author were compared and judged by blinded surgeon-reviewers as well as both chatbot models. Five academic attending surgeons from Denmark, the UK, and the US, with extensive experience in surgical organizations, research, and abstract evaluation served as reviewers.

RESULTS

Surgeon-reviewers were unable to differentiate between abstract versions. Each reviewer ranked an AI-generated version first at least once. Abstracts demonstrated no difference in their median (IQR) 10-point scores (resident, 7.0 [6.0-8.0]; senior author, 7.0 [6.0-8.0]; chatbot 1, 7.0 [6.0-8.0]; chatbot 2, 7.0 [6.0-8.0]; P = .61), 20-point scores (resident, 14.0 [12.0-7.0]; senior author, 15.0 [13.0-17.0]; chatbot 1, 14.0 [12.0-16.0]; chatbot 2, 14.0 [13.0-16.0]; P = .50), or rank (resident, 3.0 [1.0-4.0]; senior author, 2.0 [1.0-4.0]; chatbot 1, 3.0 [2.0-4.0]; chatbot 2, 2.0 [1.0-3.0]; P = .14). The abstract grades given by chatbot 1 were comparable to the surgeon-reviewers' grades. However, chatbot 2 graded more favorably than the surgeon-reviewers and chatbot 1. Median (IQR) chatbot 2-reviewer grades were higher than surgeon-reviewer grades of all 4 abstract versions (resident, 14.0 [12.0-17.0] vs 16.9 [16.0-17.5]; P = .02; senior author, 15.0 [13.0-17.0] vs 17.0 [16.5-18.0]; P = .03; chatbot 1, 14.0 [12.0-16.0] vs 17.8 [17.5-18.5]; P = .002; chatbot 2, 14.0 [13.0-16.0] vs 16.8 [14.5-18.0]; P = .04). When comparing the grades of the 2 chatbots, chatbot 2 gave higher median (IQR) grades for abstracts than chatbot 1 (resident, 14.0 [13.0-15.0] vs 16.9 [16.0-17.5]; P = .003; senior author, 13.5 [13.0-15.5] vs 17.0 [16.5-18.0]; P = .004; chatbot 1, 14.5 [13.0-15.0] vs 17.8 [17.5-18.5]; P = .003; chatbot 2, 14.0 [13.0-15.0] vs 16.8 [14.5-18.0]; P = .01).

CONCLUSIONS AND RELEVANCE

In this cross-sectional study, trained chatbots generated convincing medical abstracts, undifferentiable from resident or senior author drafts. Chatbot 1 graded abstracts similarly to surgeon-reviewers, while chatbot 2 was less stringent. These findings may assist surgeon-scientists in successfully implementing AI in medical research.

摘要

重要性:人工智能(AI)已经渗透到学术界,尤其是大型语言模型 OpenAI ChatGPT。然而,关于其在医学研究中的应用,鲜有报道。

目的:评估聊天机器人生成和评分医学研究摘要的能力。

设计、地点和参与者:在这项横断面研究中,我们对 ChatGPT 版本 3.5 和 4.0(分别称为聊天机器人 1 和聊天机器人 2)进行了培训,让它们根据背景文献、提示生成 10 篇摘要,并分析了每个主题的数据,以及 10 篇以前呈现的、不相关的摘要作为模型。研究于 2023 年 8 月至 2024 年 2 月进行(包括数据分析)。

暴露:利用相同主题和数据生成的摘要版本由外科住院医师或高级医师撰写,或由聊天机器人 1 和聊天机器人 2 生成进行比较。这 10 篇培训摘要由 8 名外科住院医师或研究员撰写,由同一名高级外科医生编辑,该研究在位于美国东南部的一家重视基于结果的研究的高容量医院进行。然后,根据前 6 个月研究期间的 5 名外科住院医师撰写的 10 篇摘要,基于 10 点和 20 点评分(第一至第四)进行比较。

主要结果和测量:主要结果测量指标是使用 10 点和 20 点评分(第一至第四)以及排名(第一至第四)的摘要等级。比较了聊天机器人 1、聊天机器人 2、初级住院医师和高级作者的摘要版本,并由盲审外科医生评审员以及两个聊天机器人模型进行判断。来自丹麦、英国和美国的 5 名学术主治外科医生,在外科组织、研究和摘要评估方面拥有丰富的经验,担任评审员。

结果:外科医生评审员无法区分摘要版本。每位评审员至少有一次将 AI 生成的版本排名第一。摘要的中位数(IQR)10 分评分没有差异(住院医师,7.0[6.0-8.0];高级作者,7.0[6.0-8.0];聊天机器人 1,7.0[6.0-8.0];聊天机器人 2,7.0[6.0-8.0];P=0.61),20 分评分(住院医师,14.0[12.0-7.0];高级作者,15.0[13.0-17.0];聊天机器人 1,14.0[12.0-16.0];聊天机器人 2,14.0[13.0-16.0];P=0.50)或排名(住院医师,3.0[1.0-4.0];高级作者,2.0[1.0-4.0];聊天机器人 1,3.0[2.0-4.0];聊天机器人 2,2.0[1.0-3.0];P=0.14)。聊天机器人 1 给出的摘要等级与外科医生评审员的等级相当。然而,聊天机器人 2 的评分比外科医生评审员和聊天机器人 1 更有利。聊天机器人 2 评审员的中位数(IQR)评分高于所有 4 个摘要版本的外科医生评审员评分(住院医师,14.0[12.0-17.0] vs 16.9[16.0-17.5];P=0.02;高级作者,15.0[13.0-17.0] vs 17.0[16.5-18.0];P=0.03;聊天机器人 1,14.0[12.0-16.0] vs 17.8[17.5-18.5];P=0.002;聊天机器人 2,14.0[13.0-16.0] vs 16.8[14.5-18.0];P=0.04)。当比较这两个聊天机器人的评分时,聊天机器人 2 对摘要的中位数(IQR)评分高于聊天机器人 1(住院医师,14.0[13.0-15.0] vs 16.9[16.0-17.5];P=0.003;高级作者,13.5[13.0-15.5] vs 17.0[16.5-18.0];P=0.004;聊天机器人 1,14.5[13.0-15.0] vs 17.8[17.5-18.5];P=0.003;聊天机器人 2,14.0[13.0-15.0] vs 16.8[14.5-18.0];P=0.01)。

结论和相关性:在这项横断面研究中,经过训练的聊天机器人生成了令人信服的医学摘要,与住院医师或高级作者的草稿无法区分。聊天机器人 1 对摘要的评分与外科医生评审员相似,而聊天机器人 2 则较为宽松。这些发现可能有助于外科医生科学家在医学研究中成功实施 AI。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9913/11297395/2ad4608fc9ef/jamanetwopen-e2425373-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9913/11297395/2ad4608fc9ef/jamanetwopen-e2425373-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9913/11297395/2ad4608fc9ef/jamanetwopen-e2425373-g001.jpg

相似文献

1
Comparison of Medical Research Abstracts Written by Surgical Trainees and Senior Surgeons or Generated by Large Language Models.外科住院医师和资深外科医生撰写的医学研究摘要与大型语言模型生成的摘要的比较。
JAMA Netw Open. 2024 Aug 1;7(8):e2425373. doi: 10.1001/jamanetworkopen.2024.25373.
2
Association of reviewer experience with discriminating human-written versus ChatGPT-written abstracts.评价者经验与区分人工撰写与 ChatGPT 撰写摘要的关联。
Int J Gynecol Cancer. 2024 May 6;34(5):669-674. doi: 10.1136/ijgc-2023-005162.
3
Evaluation and Comparison of Ophthalmic Scientific Abstracts and References by Current Artificial Intelligence Chatbots.当前人工智能聊天机器人对眼科科学摘要和参考文献的评估与比较。
JAMA Ophthalmol. 2023 Sep 1;141(9):819-824. doi: 10.1001/jamaophthalmol.2023.3119.
4
Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis.评估 ChatGPT 和 Bard 生成的结构化摘要与脊柱外科领域人类撰写的摘要在可重复性方面的比较:对比分析。
J Med Internet Res. 2024 Jun 26;26:e52001. doi: 10.2196/52001.
5
Human vs machine: identifying ChatGPT-generated abstracts in Gynecology and Urogynecology.人机之争:在妇科和泌尿外科学中识别 ChatGPT 生成的摘要。
Am J Obstet Gynecol. 2024 Aug;231(2):276.e1-276.e10. doi: 10.1016/j.ajog.2024.04.045. Epub 2024 May 6.
6
Accuracy and Reliability of Chatbot Responses to Physician Questions.聊天机器人对医生提问回答的准确性和可靠性。
JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.
7
Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures.基于大语言模型的聊天机器人与外科医生生成的常见手术知情同意书文档。
JAMA Netw Open. 2023 Oct 2;6(10):e2336997. doi: 10.1001/jamanetworkopen.2023.36997.
8
Identification of ChatGPT-Generated Abstracts Within Shoulder and Elbow Surgery Poses a Challenge for Reviewers.识别肩部和肘部手术领域中由ChatGPT生成的摘要对审稿人来说是一项挑战。
Arthroscopy. 2025 Apr;41(4):916-924.e2. doi: 10.1016/j.arthro.2024.06.045. Epub 2024 Jul 9.
9
Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.前瞻性评估 4 种大型语言模型聊天机器人对患者关于急救护理问题的回答的准确性:实验性对比研究。
J Med Internet Res. 2024 Nov 4;26:e60291. doi: 10.2196/60291.
10
Using Intraoperative Recordings to Evaluate Surgical Technique and Performance in Mastoidectomy.运用术中记录评估乳突切除术的手术技术和效果。
JAMA Otolaryngol Head Neck Surg. 2020 Oct 1;146(10):893-899. doi: 10.1001/jamaoto.2020.2063.

引用本文的文献

1
Bots in white coats: are large language models the future of patient education? A multicenter cross-sectional analysis.穿白大褂的机器人:大语言模型会是患者教育的未来吗?一项多中心横断面分析。
Int J Surg. 2025 Mar 1;111(3):2376-2384. doi: 10.1097/JS9.0000000000002250.
2
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.全球医学考试中的大语言模型:平台开发与综合分析
J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.

本文引用的文献

1
Association of reviewer experience with discriminating human-written versus ChatGPT-written abstracts.评价者经验与区分人工撰写与 ChatGPT 撰写摘要的关联。
Int J Gynecol Cancer. 2024 May 6;34(5):669-674. doi: 10.1136/ijgc-2023-005162.
2
Potential and Limitations of ChatGPT 3.5 and 4.0 as a Source of COVID-19 Information: Comprehensive Comparative Analysis of Generative and Authoritative Information.ChatGPT 3.5 和 4.0 作为 COVID-19 信息来源的潜力和局限性:生成信息和权威信息的综合比较分析。
J Med Internet Res. 2023 Dec 14;25:e49771. doi: 10.2196/49771.
3
Large language models show human-like content biases in transmission chain experiments.
大型语言模型在传输链实验中表现出类人内容偏见。
Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2313790120. doi: 10.1073/pnas.2313790120. Epub 2023 Oct 26.
4
Artificial intelligence takes center stage: exploring the capabilities and implications of ChatGPT and other AI-assisted technologies in scientific research and education.人工智能占据中心舞台:探索 ChatGPT 和其他人工智能辅助技术在科学研究和教育中的能力和影响。
Immunol Cell Biol. 2023 Nov-Dec;101(10):923-935. doi: 10.1111/imcb.12689. Epub 2023 Sep 18.
5
ChatGPT and Artificial Intelligence in Medical Writing: Concerns and Ethical Considerations.医学写作中的ChatGPT与人工智能:担忧与伦理考量
Cureus. 2023 Aug 10;15(8):e43292. doi: 10.7759/cureus.43292. eCollection 2023 Aug.
6
An extensive benchmark study on biomedical text generation and mining with ChatGPT.一项关于使用ChatGPT进行生物医学文本生成和挖掘的广泛基准研究。
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad557.
7
Creation and Adoption of Large Language Models in Medicine.医学领域中大型语言模型的创建与采用。
JAMA. 2023 Sep 5;330(9):866-869. doi: 10.1001/jama.2023.14217.
8
Defining surgical risk in octogenarians undergoing paraesophageal hernia repair.定义 80 岁以上行食管裂孔疝修补术患者的手术风险。
Surg Endosc. 2023 Nov;37(11):8644-8654. doi: 10.1007/s00464-023-10270-z. Epub 2023 Jul 26.
9
How appropriate are answers of online chat-based artificial intelligence (ChatGPT) to common questions on colon cancer?基于在线聊天的人工智能(ChatGPT)对结肠癌常见问题的回答有多恰当?
Surgery. 2023 Nov;174(5):1273-1275. doi: 10.1016/j.surg.2023.06.005. Epub 2023 Jul 21.
10
AI and ChatGPT Meet Surgery: A Word of Caution for Surgeon-Scientists.人工智能与ChatGPT涉足外科领域:给外科科学家的一句警示。
Ann Surg. 2023 Nov 1;278(5):e943-e944. doi: 10.1097/SLA.0000000000006000. Epub 2023 Jul 20.