Rao Arya S, Mazumder Aneesh, Roux Elizabeth, Young Cameron, Bott Ethan, Wang Julie, Kochis Michael, Stetson Alyssa, Butler Alex, Hilker Sidney, Succi Marc D
Harvard Medical School, Boston, MA, United States; Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA, United States.
Harvard College, Harvard University, Cambridge, MA, United States.
J Pediatr Surg. 2025 Sep 8:162654. doi: 10.1016/j.jpedsurg.2025.162654.
Large language models (LLMs) have been shown to translate information from highly specific domains into lay-digestible terms. Pediatric surgery remains an area in which it is difficult to communicate clinical information in an age-appropriate manner, given the vast diversity in language comprehension levels across patient populations and the complexity of procedures performed. This study evaluates LLMs as tools for generating explanations of common pediatric surgeries to increase efficiency and quality of communication.
Two generalist LLMs (GPT-4-turbo [OpenAI] and Gemini 1.0 Pro [Google]; accessed March 2024) were provided the following prompt: "Act as a pediatric surgeon and explain a [PROCEDURE] to a [AGE] old [GENDER] in age-appropriate language. Discuss indications for the procedure, steps of the procedure, possible complications, and post-operative recovery." Responses were generated for 4 common pediatric surgeries (appendectomy, umbilical hernia repair, cholecystectomy, and gastrostomy tube placement) for male and female children of ages 5, 8, 10, 13, and 16 years. Forty responses from each LLM were rated for accuracy, completeness, age-appropriateness, possibility of demographic bias, and overall quality by two pediatricians and two general surgeons using a five-point Likert scale. Numeric ratings were summarized as means and 95% confidence intervals. An ordinal mixed-effects model with rater as a random effect was used to account for clustering by rater. P<0.05 was considered statistically significant.
Responses from GPT-4-turbo and Gemini 1.0 Pro models were both rated with moderately high overall quality (GPT4: 3.97 [3.82, 4.12]; Gemini 1.0 Pro: 3.39 [3.20, 3.57]) and moderately low possibility of demographic bias (GPT4: 2.49 [2.38, 2.60]; Gemini 1.0 Pro: 2.93 [2.79, 3.07]). GPT-4-turbo responses were rated as highly accurate (4.18 [4.05, 4.32]), highly complete (4.21 [4.10, 4.33]), and highly age-appropriate (4.10 [3.96, 4.24]), while Gemini 1.0 Pro responses were rated as moderately accurate (3.83 [3.70, 3.96]), moderately complete (3.95 [3.83, 4.07]) and moderately age-appropriate (3.63 [3.47, 3.79]). With GPT-4-turbo, ratings on most measures tend to improve as patient age increases, whereas with Gemini 1.0 Pro, they tend to worsen as patient age increases. Ratings on all measures, with the exception of age-appropriateness, were slightly higher for responses generated for male patients as compared to female patients with GPT-4-turbo, while the gender differences were less pronounced with Gemini 1.0 Pro.
This study demonstrates that off-the-shelf LLMs have the potential to produce accurate, complete, and age-appropriate explanations of common pediatric surgeries with low possibility of demographic bias. Inter-model variability in areas such as quality of response, age-appropriateness and gender differences were also observed, signaling the need for additional validation and fine-tuning based on the clinical content. Such tools could be implemented at the point of care or in other patient education settings and personalized to ensure effective, equitable communication of pertinent medical information with demonstration of clinician-rated content quality.
This is a pilot study evaluating the performance of large language models (LLMs) as patient-centered communication aids in pediatric surgery.
Level IV (pilot study).
大语言模型(LLMs)已被证明能够将高度特定领域的信息转化为通俗易懂的表述。鉴于患者群体的语言理解水平差异巨大以及所实施手术的复杂性,小儿外科领域仍然难以以适合患者年龄的方式传达临床信息。本研究评估大语言模型作为生成常见小儿外科手术解释的工具,以提高沟通效率和质量。
向两个通用大语言模型(GPT - 4 - turbo[OpenAI]和Gemini 1.0 Pro[谷歌];于2024年3月访问)提供以下提示:“扮演小儿外科医生,用适合年龄的语言向一名[年龄]岁的[性别]儿童解释[手术名称]。讨论该手术的适应症、步骤、可能的并发症以及术后恢复情况。”针对5岁、8岁、10岁、13岁和16岁的男童和女童的4种常见小儿外科手术(阑尾切除术、脐疝修补术、胆囊切除术和胃造瘘管置入术)生成回复。两名儿科医生和两名普通外科医生使用五点李克特量表对每个大语言模型的40条回复进行准确性、完整性、年龄适宜性、人口统计学偏差可能性和总体质量的评分。数值评分总结为均值和95%置信区间。使用以评分者为随机效应的有序混合效应模型来考虑评分者的聚类情况。P<0.05被认为具有统计学意义。
GPT - 4 - turbo和Gemini 1.0 Pro模型的回复总体质量评分均为中等偏高(GPT4:3.97[3.82, 4.12];Gemini 1.0 Pro:3.39[3.20, 3.57]),人口统计学偏差可能性中等偏低(GPT4:2.49[2.38, 2.60];Gemini 1.0 Pro:2.93[2.79, 3.07])。GPT - 4 - turbo的回复在准确性(4.18[4.05, 4.32])、完整性(4.21[4.10, 4.33])和年龄适宜性(4.10[3.96, 4.24])方面评分较高,而Gemini 1.0 Pro的回复在准确性(3.83[3.70, 3.96])、完整性(3.95[3.83, 4.07])和年龄适宜性(3.63[3.47, 3.79])方面评分中等。对于GPT - 4 - turbo,随着患者年龄增加,大多数指标的评分往往会提高,而对于Gemini 1.0 Pro,随着患者年龄增加,评分往往会降低。对于GPT - 4 - turbo,除年龄适宜性外,男性患者生成的回复在所有指标上的评分均略高于女性患者,而Gemini 1.0 Pro的性别差异则不太明显。
本研究表明,现成的大语言模型有可能对常见小儿外科手术做出准确、完整且适合年龄的解释,人口统计学偏差可能性较低。在回复质量、年龄适宜性和性别差异等方面也观察到了模型间的变异性,这表明需要基于临床内容进行额外的验证和微调。此类工具可在医疗点或其他患者教育环境中实施,并进行个性化设置,以确保有效、公平地传达相关医疗信息,并展示临床医生评定的内容质量。
这是一项评估大语言模型(LLMs)作为小儿外科以患者为中心的沟通辅助工具性能的试点研究。
四级(试点研究)。