胃语大模型：一种概念验证型定制临床语言模型的开发与对照测试

BACKGROUND AND STUDY AIMS: Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios. METHODS: In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted. RESULTS: A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all < 0.001). It outperformed comparators in six of seven tasks ( < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4-260.35) ( < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators ( < 0.001). Multivariate analysis revealed that model type significantly predicted performance ( < 0.001). CONCLUSIONS: This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.

背景与研究目的：当前的通用人工智能（AI）大语言模型（LLMs）在临床医学中的功效有限，通常局限于问答、文档记录和文献总结等角色。我们开发了GastroGPT，这是一个概念验证的特定专业、多任务临床大语言模型，并在关键的胃肠病学任务和不同病例场景中，将其性能与领先的通用大语言模型进行了评估。方法：在这项结构化分析中，将GastroGPT与三个最先进的通用大语言模型（LLM-A：GPT-4，LLM-B：Bard，LLM-C：Claude）进行比较。在七个临床任务以及10个模拟的胃肠病学病例中评估模型的整体性能，这些病例在复杂性、频率和患者人口统计学方面各不相同。标准化提示有助于进行结构化比较。一个盲法专家小组根据10分制李克特量表对每个任务的模型输出进行评分，判断其临床实用性。进行了全面的统计分析。结果：总共获得了2240个专家评分。与GPT-4（5.2±3.0）、Bard（5.7±3.3）和Claude（7.0±2.7）相比，GastroGPT的平均总分显著更高（8.1±1.8）（均P<0.001）。在七个任务中的六个任务中，它的表现优于比较对象（P<（此处原文似乎有误，推测应为P<0.05）），除了随访计划。与通用模型（97.4 - 260.35）相比，GastroGPT表现出更高的分数一致性（方差34.95）（P<0.001）。与比较对象不同，其性能在病例复杂性和频率方面保持一致（P<0.001）。多变量分析显示，模型类型显著预测性能（P<0.001）。结论：本研究率先开展了特定专业、面向临床的AI模型与通用大语言模型的开发和比较。GastroGPT在总体和关键胃肠病学任务上表现出卓越的实用性，凸显了针对医学中特定任务定制AI模型的潜力。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

推荐工具