人工智能大语言模型关联型聊天机器人在胃食管反流病手术决策中的应用。
The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease.
机构信息
Division of General Surgery, Department of Surgery, McMaster University, Hamilton, ON, Canada.
University of California South California, East Bay, Oakland, CA, USA.
出版信息
Surg Endosc. 2024 May;38(5):2320-2330. doi: 10.1007/s00464-024-10807-w. Epub 2024 Apr 17.
BACKGROUND
Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD).
METHODS
Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages.
RESULTS
Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity.
CONCLUSIONS
Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM's when utilized for advice on surgical management of GERD. Additional training of LLM's using evidence-based health information is needed.
背景
大型语言模型(LLM)链接的聊天机器人可能是医疗保健提供者和患者获取临床建议的有效来源。本研究评估了 LLM 链接的聊天机器人在提供胃食管反流病(GERD)手术管理建议方面的表现。
方法
根据美国胃肠内镜外科医师学会(SAGES)GERD 手术治疗指南中提出的关键问题,创建了 9 个患者病例。2023 年 11 月 16 日,向 ChatGPT-3.5、ChatGPT-4、Copilot、Google Bard 和 Perplexity AI 查询了有关 GERD 手术管理的建议。准确的聊天机器人性能定义为与 SAGES 指南建议一致的响应数量。结果以计数和百分比报告。
结果
根据 SAGES 指南,胃肠外科医生在成人患者的 7 个关键问题中获得了 5/7(71.4%)的准确 GERD 手术管理建议,Copilot 获得了 3/7(42.9%),Google Bard 获得了 6/7(85.7%),Perplexity 获得了 3/7(42.9%)。在 5 个关键问题中,患者分别获得了 ChatGPT-4 的 3/5(60.0%)、Copilot 的 2/5(40.0%)、Google Bard 的 4/5(80.0%)和 Perplexity 的 1/5(20.0%)的准确建议。在儿科患者中,胃肠外科医生获得了 ChatGPT-4 的 2/3(66.7%)、Copilot 的 3/3(100.0%)、Google Bard 的 3/3(100.0%)和 Perplexity 的 2/3(66.7%)的准确建议。患者分别获得了 ChatGPT-4 的 2/2(100.0%)、Copilot 的 2/2(100.0%)、Google Bard 的 1/2(50.0%)和 Perplexity 的 1/2(50.0%)的适当指导。
结论
胃肠外科医生、胃肠病学家和患者在使用 LLM 提供 GERD 手术管理建议时,应认识到其承诺和陷阱。需要使用基于证据的健康信息对 LLM 进行额外培训。