文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

胃语大模型:一种概念验证型定制临床语言模型的开发与对照测试

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

作者信息

Simsek Cem, Ucdal Mete, de-Madaria Enrique, Ebigbo Alanna, Vanek Petr, Elshaarawy Omar, Voiosu Theodor Alexandru, Antonelli Giulio, Turró Román, Gisbert Javier P, Nyssen Olga P, Hassan Cesare, Messmann Helmut, Jalan Rajiv

机构信息

Gastroenterology & Hepatology, Johns Hopkins Medical Institutions Campus, Baltimore, United States.

internal medicine, Hacettepe University Faculty of Medicine, Ankara, Turkey.

出版信息

Endosc Int Open. 2025 Aug 6;13:a26372163. doi: 10.1055/a-2637-2163. eCollection 2025.


DOI:10.1055/a-2637-2163
PMID:40860687
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12371664/
Abstract

BACKGROUND AND STUDY AIMS: Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios. METHODS: In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted. RESULTS: A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all < 0.001). It outperformed comparators in six of seven tasks ( < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4-260.35) ( < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators ( < 0.001). Multivariate analysis revealed that model type significantly predicted performance ( < 0.001). CONCLUSIONS: This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.

摘要

背景与研究目的:当前的通用人工智能(AI)大语言模型(LLMs)在临床医学中的功效有限,通常局限于问答、文档记录和文献总结等角色。我们开发了GastroGPT,这是一个概念验证的特定专业、多任务临床大语言模型,并在关键的胃肠病学任务和不同病例场景中,将其性能与领先的通用大语言模型进行了评估。 方法:在这项结构化分析中,将GastroGPT与三个最先进的通用大语言模型(LLM-A:GPT-4,LLM-B:Bard,LLM-C:Claude)进行比较。在七个临床任务以及10个模拟的胃肠病学病例中评估模型的整体性能,这些病例在复杂性、频率和患者人口统计学方面各不相同。标准化提示有助于进行结构化比较。一个盲法专家小组根据10分制李克特量表对每个任务的模型输出进行评分,判断其临床实用性。进行了全面的统计分析。 结果:总共获得了2240个专家评分。与GPT-4(5.2±3.0)、Bard(5.7±3.3)和Claude(7.0±2.7)相比,GastroGPT的平均总分显著更高(8.1±1.8)(均P<0.001)。在七个任务中的六个任务中,它的表现优于比较对象(P<(此处原文似乎有误,推测应为P<0.05)),除了随访计划。与通用模型(97.4 - 260.35)相比,GastroGPT表现出更高的分数一致性(方差34.95)(P<0.001)。与比较对象不同,其性能在病例复杂性和频率方面保持一致(P<0.001)。多变量分析显示,模型类型显著预测性能(P<0.001)。 结论:本研究率先开展了特定专业、面向临床的AI模型与通用大语言模型的开发和比较。GastroGPT在总体和关键胃肠病学任务上表现出卓越的实用性,凸显了针对医学中特定任务定制AI模型的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/7a96992653ea/10-1055-a-2637-2163_26389136.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/fc65f7198762/10-1055-a-2637-2163_26389131.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/127a5db7ec78/10-1055-a-2637-2163_26389132.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/cb9fb19ef480/10-1055-a-2637-2163_26389133.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/7b121ecd5d16/10-1055-a-2637-2163_26389134.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/1f88600af480/10-1055-a-2637-2163_26389135.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/7a96992653ea/10-1055-a-2637-2163_26389136.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/fc65f7198762/10-1055-a-2637-2163_26389131.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/127a5db7ec78/10-1055-a-2637-2163_26389132.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/cb9fb19ef480/10-1055-a-2637-2163_26389133.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/7b121ecd5d16/10-1055-a-2637-2163_26389134.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/1f88600af480/10-1055-a-2637-2163_26389135.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/25a6/12371664/7a96992653ea/10-1055-a-2637-2163_26389136.jpg

相似文献

[1]
GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

Endosc Int Open. 2025-8-6

[2]
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

JMIR Form Res. 2024-12-17

[3]
Large Language Models and Empathy: Systematic Review.

J Med Internet Res. 2024-12-11

[4]
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.

J Med Internet Res. 2025-6-4

[5]
Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.

JMIRx Med. 2025-8-29

[6]
Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study.

J Med Internet Res. 2025-8-6

[7]
Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.

J Med Internet Res. 2025-7-11

[8]
Comparison of a Specialized Large Language Model with GPT-4o for CT and MRI Radiology Report Summarization.

Radiology. 2025-8

[9]
Advancing health coaching: A comparative study of large language model and health coaches.

Artif Intell Med. 2024-11

[10]
Development and Validation of a Large Language Model-Based System for Medical History-Taking Training: Prospective Multicase Study on Evaluation Stability, Human-AI Consistency, and Transparency.

JMIR Med Educ. 2025-8-29

本文引用的文献

[1]
Toward expert-level medical question answering with large language models.

Nat Med. 2025-3

[2]
A vignette-based evaluation of ChatGPT's ability to provide appropriate and equitable medical advice across care contexts.

Sci Rep. 2023-10-19

[3]
ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.

Eur Radiol. 2024-5

[4]
As artificial intelligence goes multimodal, medical applications multiply.

Science. 2023-9-15

[5]
Scientific discovery in the age of artificial intelligence.

Nature. 2023-8

[6]
Large language models in medicine.

Nat Med. 2023-8

[7]
Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study.

JMIR Med Educ. 2023-7-10

[8]
Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.

Ophthalmol Sci. 2023-5-5

[9]
Chat Generative Pretrained Transformer Fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test.

Am J Gastroenterol. 2023-12-1

[10]
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.

JAMA Intern Med. 2023-6-1

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索