Sezgin Emre, Jackson Daniel I, Kocaballi A Baki, Bibart Mindy, Zupanec Sue, Landier Wendy, Audino Anthony, Ranalli Mark, Skeens Micah
The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, Ohio, USA.
The Ohio State University College of Medicine, Columbus, Ohio, USA.
Cancer Med. 2025 Jan;14(1):e70554. doi: 10.1002/cam4.70554.
Caregivers in pediatric oncology need accurate and understandable information about their child's condition, treatment, and side effects. This study assesses the performance of publicly accessible large language model (LLM)-supported tools in providing valuable and reliable information to caregivers of children with cancer.
In this cross-sectional study, we evaluated the performance of the four LLM-supported tools-ChatGPT (GPT-4), Google Bard (Gemini Pro), Microsoft Bing Chat, and Google SGE-against a set of frequently asked questions (FAQs) derived from the Children's Oncology Group Family Handbook and expert input (In total, 26 FAQs and 104 generated responses). Five pediatric oncology experts assessed the generated LLM responses using measures including accuracy, clarity, inclusivity, completeness, clinical utility, and overall rating. Additionally, the content quality was evaluated including readability, AI disclosure, source credibility, resource matching, and content originality. We used descriptive analysis and statistical tests including Shapiro-Wilk, Levene's, Kruskal-Wallis H-tests, and Dunn's post hoc tests for pairwise comparisons.
ChatGPT shows high overall performance when evaluated by the experts. Bard also performed well, especially in accuracy and clarity of the responses, whereas Bing Chat and Google SGE had lower overall scores. Regarding the disclosure of responses being generated by AI, it was observed less frequently in ChatGPT responses, which may have affected the clarity of responses, whereas Bard maintained a balance between AI disclosure and response clarity. Google SGE generated the most readable responses whereas ChatGPT answered with the most complexity. LLM tools varied significantly (p < 0.001) across all expert evaluations except inclusivity. Through our thematic analysis of expert free-text comments, emotional tone and empathy emerged as a unique theme with mixed feedback on expectations from AI to be empathetic.
LLM-supported tools can enhance caregivers' knowledge of pediatric oncology. Each model has unique strengths and areas for improvement, indicating the need for careful selection based on specific clinical contexts. Further research is required to explore their application in other medical specialties and patient demographics, assessing broader applicability and long-term impacts.
儿科肿瘤护理人员需要有关其孩子病情、治疗及副作用的准确且易懂的信息。本研究评估了公开可用的、由大语言模型(LLM)支持的工具在为癌症患儿护理人员提供有价值且可靠信息方面的表现。
在这项横断面研究中,我们对照一组源自儿童肿瘤学组家庭手册及专家意见的常见问题(共26个常见问题及104条生成的回答),评估了四种由LLM支持的工具——ChatGPT(GPT - 4)、谷歌巴德(Gemini Pro)、微软必应聊天和谷歌SGE的表现。五位儿科肿瘤专家使用包括准确性、清晰度、包容性、完整性、临床实用性和总体评分等指标,对生成的LLM回答进行评估。此外,还评估了内容质量,包括可读性、人工智能披露、来源可信度、资源匹配度和内容原创性。我们使用了描述性分析和统计检验,包括夏皮罗 - 威尔克检验、莱文检验、克鲁斯卡尔 - 沃利斯H检验以及用于两两比较的邓恩事后检验。
经专家评估,ChatGPT总体表现出色。巴德的表现也不错,尤其是在回答的准确性和清晰度方面,而必应聊天和谷歌SGE的总体得分较低。关于人工智能生成回答的披露情况,在ChatGPT的回答中较少出现,这可能影响了回答的清晰度,而巴德在人工智能披露和回答清晰度之间保持了平衡。谷歌SGE生成的回答可读性最强,而ChatGPT的回答最为复杂。除包容性外,LLM工具在所有专家评估中差异显著(p < 0.001)。通过对专家自由文本评论的主题分析,情感基调与同理心成为一个独特主题,对于人工智能具有同理心的期望反馈不一。
由LLM支持的工具可以增强护理人员对儿科肿瘤学的了解。每个模型都有独特的优势和改进空间,这表明需要根据具体临床情况进行谨慎选择。需要进一步研究以探索它们在其他医学专业和患者群体中的应用,评估更广泛的适用性和长期影响。