• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用大语言模型设计儿科手术中以患者为中心的沟通辅助工具

Designing Patient-Centered Communication Aids in Pediatric Surgery Using Large Language Models.

作者信息

Rao Arya S, Mazumder Aneesh, Roux Elizabeth, Young Cameron, Bott Ethan, Wang Julie, Kochis Michael, Stetson Alyssa, Butler Alex, Hilker Sidney, Succi Marc D

机构信息

Harvard Medical School, Boston, MA, United States; Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA, United States.

Harvard College, Harvard University, Cambridge, MA, United States.

出版信息

J Pediatr Surg. 2025 Sep 8:162654. doi: 10.1016/j.jpedsurg.2025.162654.

DOI:10.1016/j.jpedsurg.2025.162654
PMID:40930385
Abstract

INTRODUCTION

Large language models (LLMs) have been shown to translate information from highly specific domains into lay-digestible terms. Pediatric surgery remains an area in which it is difficult to communicate clinical information in an age-appropriate manner, given the vast diversity in language comprehension levels across patient populations and the complexity of procedures performed. This study evaluates LLMs as tools for generating explanations of common pediatric surgeries to increase efficiency and quality of communication.

METHODS

Two generalist LLMs (GPT-4-turbo [OpenAI] and Gemini 1.0 Pro [Google]; accessed March 2024) were provided the following prompt: "Act as a pediatric surgeon and explain a [PROCEDURE] to a [AGE] old [GENDER] in age-appropriate language. Discuss indications for the procedure, steps of the procedure, possible complications, and post-operative recovery." Responses were generated for 4 common pediatric surgeries (appendectomy, umbilical hernia repair, cholecystectomy, and gastrostomy tube placement) for male and female children of ages 5, 8, 10, 13, and 16 years. Forty responses from each LLM were rated for accuracy, completeness, age-appropriateness, possibility of demographic bias, and overall quality by two pediatricians and two general surgeons using a five-point Likert scale. Numeric ratings were summarized as means and 95% confidence intervals. An ordinal mixed-effects model with rater as a random effect was used to account for clustering by rater. P<0.05 was considered statistically significant.

RESULTS

Responses from GPT-4-turbo and Gemini 1.0 Pro models were both rated with moderately high overall quality (GPT4: 3.97 [3.82, 4.12]; Gemini 1.0 Pro: 3.39 [3.20, 3.57]) and moderately low possibility of demographic bias (GPT4: 2.49 [2.38, 2.60]; Gemini 1.0 Pro: 2.93 [2.79, 3.07]). GPT-4-turbo responses were rated as highly accurate (4.18 [4.05, 4.32]), highly complete (4.21 [4.10, 4.33]), and highly age-appropriate (4.10 [3.96, 4.24]), while Gemini 1.0 Pro responses were rated as moderately accurate (3.83 [3.70, 3.96]), moderately complete (3.95 [3.83, 4.07]) and moderately age-appropriate (3.63 [3.47, 3.79]). With GPT-4-turbo, ratings on most measures tend to improve as patient age increases, whereas with Gemini 1.0 Pro, they tend to worsen as patient age increases. Ratings on all measures, with the exception of age-appropriateness, were slightly higher for responses generated for male patients as compared to female patients with GPT-4-turbo, while the gender differences were less pronounced with Gemini 1.0 Pro.

DISCUSSION

This study demonstrates that off-the-shelf LLMs have the potential to produce accurate, complete, and age-appropriate explanations of common pediatric surgeries with low possibility of demographic bias. Inter-model variability in areas such as quality of response, age-appropriateness and gender differences were also observed, signaling the need for additional validation and fine-tuning based on the clinical content. Such tools could be implemented at the point of care or in other patient education settings and personalized to ensure effective, equitable communication of pertinent medical information with demonstration of clinician-rated content quality.

STUDY TYPE

This is a pilot study evaluating the performance of large language models (LLMs) as patient-centered communication aids in pediatric surgery.

LEVEL OF EVIDENCE

Level IV (pilot study).

摘要

引言

大语言模型(LLMs)已被证明能够将高度特定领域的信息转化为通俗易懂的表述。鉴于患者群体的语言理解水平差异巨大以及所实施手术的复杂性,小儿外科领域仍然难以以适合患者年龄的方式传达临床信息。本研究评估大语言模型作为生成常见小儿外科手术解释的工具,以提高沟通效率和质量。

方法

向两个通用大语言模型(GPT - 4 - turbo[OpenAI]和Gemini 1.0 Pro[谷歌];于2024年3月访问)提供以下提示:“扮演小儿外科医生,用适合年龄的语言向一名[年龄]岁的[性别]儿童解释[手术名称]。讨论该手术的适应症、步骤、可能的并发症以及术后恢复情况。”针对5岁、8岁、10岁、13岁和16岁的男童和女童的4种常见小儿外科手术(阑尾切除术、脐疝修补术、胆囊切除术和胃造瘘管置入术)生成回复。两名儿科医生和两名普通外科医生使用五点李克特量表对每个大语言模型的40条回复进行准确性、完整性、年龄适宜性、人口统计学偏差可能性和总体质量的评分。数值评分总结为均值和95%置信区间。使用以评分者为随机效应的有序混合效应模型来考虑评分者的聚类情况。P<0.05被认为具有统计学意义。

结果

GPT - 4 - turbo和Gemini 1.0 Pro模型的回复总体质量评分均为中等偏高(GPT4:3.97[3.82, 4.12];Gemini 1.0 Pro:3.39[3.20, 3.57]),人口统计学偏差可能性中等偏低(GPT4:2.49[2.38, 2.60];Gemini 1.0 Pro:2.93[2.79, 3.07])。GPT - 4 - turbo的回复在准确性(4.18[4.05, 4.32])、完整性(4.21[4.10, 4.33])和年龄适宜性(4.10[3.96, 4.24])方面评分较高,而Gemini 1.0 Pro的回复在准确性(3.83[3.70, 3.96])、完整性(3.95[3.83, 4.07])和年龄适宜性(3.63[3.47, 3.79])方面评分中等。对于GPT - 4 - turbo,随着患者年龄增加,大多数指标的评分往往会提高,而对于Gemini 1.0 Pro,随着患者年龄增加,评分往往会降低。对于GPT - 4 - turbo,除年龄适宜性外,男性患者生成的回复在所有指标上的评分均略高于女性患者,而Gemini 1.0 Pro的性别差异则不太明显。

讨论

本研究表明,现成的大语言模型有可能对常见小儿外科手术做出准确、完整且适合年龄的解释,人口统计学偏差可能性较低。在回复质量、年龄适宜性和性别差异等方面也观察到了模型间的变异性,这表明需要基于临床内容进行额外的验证和微调。此类工具可在医疗点或其他患者教育环境中实施,并进行个性化设置,以确保有效、公平地传达相关医疗信息,并展示临床医生评定的内容质量。

研究类型

这是一项评估大语言模型(LLMs)作为小儿外科以患者为中心的沟通辅助工具性能的试点研究。

证据水平

四级(试点研究)。

相似文献

1
Designing Patient-Centered Communication Aids in Pediatric Surgery Using Large Language Models.使用大语言模型设计儿科手术中以患者为中心的沟通辅助工具
J Pediatr Surg. 2025 Sep 8:162654. doi: 10.1016/j.jpedsurg.2025.162654.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
[Volume and health outcomes: evidence from systematic reviews and from evaluation of Italian hospital data].[容量与健康结果:来自系统评价和意大利医院数据评估的证据]
Epidemiol Prev. 2013 Mar-Jun;37(2-3 Suppl 2):1-100.
4
Evaluation of Large Language Models in Tailoring Educational Content for Cancer Survivors and Their Caregivers: Quality Analysis.大型语言模型在为癌症幸存者及其护理人员量身定制教育内容方面的评估:质量分析
JMIR Cancer. 2025 Apr 7;11:e67914. doi: 10.2196/67914.
5
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
6
Sexual Harassment and Prevention Training性骚扰与预防培训
7
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
8
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
9
How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors.人工智能的准确性如何?对常用大语言模型回应患者对偶然发现的肾肿瘤担忧的批判性评估。
J Clin Med. 2025 Aug 12;14(16):5697. doi: 10.3390/jcm14165697.
10
Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.快速对用于诊断合并症患者的大语言模型进行基准测试:利用“大语言模型即评判者”方法的比较研究
JMIRx Med. 2025 Aug 29;6:e67661. doi: 10.2196/67661.