Yan Chunyi, Li Zexi, Liang Yongzhou, Shao Shuran, Ma Fan, Zhang Nanjun, Li Bowen, Wang Chuan, Zhou Kaiyu
Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China.
Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China.
Front Artif Intell. 2025 Mar 31;8:1571503. doi: 10.3389/frai.2025.1571503. eCollection 2025.
Kawasaki disease (KD) presents complex clinical challenges in diagnosis, treatment, and long-term management, requiring a comprehensive understanding by both parents and healthcare providers. With advancements in artificial intelligence (AI), large language models (LLMs) have shown promise in supporting medical practice. This study aims to evaluate and compare the appropriateness and comprehensibility of different LLMs in answering clinically relevant questions about KD and assess the impact of different prompting strategies.
Twenty-five questions were formulated, incorporating three prompting strategies: No prompting (NO), Parent-friendly (PF), and Doctor-level (DL). These questions were input into three LLMs: ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Responses were evaluated based on appropriateness, educational quality, comprehensibility, cautionary statements, references, and potential misinformation, using Information Quality Grade, Global Quality Scale (GQS), Flesch Reading Ease (FRE) score, and word count.
Significant differences were found among the LLMs in terms of response educational quality, accuracy, and comprehensibility ( < 0.001). Claude 3.5 provided the highest proportion of completely correct responses (51.1%) and achieved the highest median GQS score (5.0), outperforming GPT-4o (4.0) and Gemini 1.5 (3.0) significantly. Gemini 1.5 achieved the highest FRE score (31.5) and provided highest proportion of responses assessed as comprehensible (80.4%). Prompting strategies significantly affected LLM responses. Claude 3.5 Sonnet with DL prompting had the highest completely correct rate (81.3%), while PF prompting yielded the most acceptable responses (97.3%). Gemini 1.5 Pro showed minimal variation across prompts but excelled in comprehensibility (98.7% under PF prompting).
This study indicates that LLMs have great potential in providing information about KD, but their use requires caution due to quality inconsistencies and misinformation risks. Significant discrepancies existed across LLMs and prompting strategies. Claude 3.5 Sonnet offered the best response quality and accuracy, while Gemini 1.5 Pro excelled in comprehensibility. PF prompting with Claude 3.5 Sonnet is most recommended for parents seeking KD information. As AI evolves, expanding research and refining models is crucial to ensure reliable, high-quality information.
川崎病(KD)在诊断、治疗和长期管理方面带来了复杂的临床挑战,这需要家长和医疗服务提供者全面了解。随着人工智能(AI)的发展,大语言模型(LLMs)在支持医疗实践方面显示出了前景。本研究旨在评估和比较不同大语言模型在回答有关川崎病的临床相关问题时的恰当性和可理解性,并评估不同提示策略的影响。
制定了25个问题,纳入了三种提示策略:无提示(NO)、家长友好型(PF)和医生水平型(DL)。这些问题被输入到三个大语言模型中:ChatGPT-4o、Claude 3.5 Sonnet和Gemini 1.5 Pro。使用信息质量等级、全球质量量表(GQS)、弗莱什易读性(FRE)得分和单词计数,根据恰当性、教育质量、可理解性、警示声明、参考文献和潜在错误信息对回答进行评估。
在回答的教育质量、准确性和可理解性方面,大语言模型之间存在显著差异(<0.001)。Claude 3.5给出的完全正确回答比例最高(51.1%),并且获得了最高的GQS中位数得分(5.0),显著优于GPT-4o(4.0)和Gemini 1.5(3.0)。Gemini 1.5获得了最高的FRE得分(31.5),并且给出的被评估为可理解的回答比例最高(80.4%)。提示策略对大语言模型的回答有显著影响。采用DL提示的Claude 3.5 Sonnet完全正确率最高(81.3%),而PF提示产生的可接受回答最多(97.3%)。Gemini 1.5 Pro在不同提示下变化最小,但在可理解性方面表现出色(PF提示下为98.7%)。
本研究表明,大语言模型在提供有关川崎病的信息方面具有巨大潜力,但由于质量不一致和错误信息风险,其使用需要谨慎。不同大语言模型和提示策略之间存在显著差异。Claude 3.5 Sonnet提供了最佳的回答质量和准确性,而Gemini 1.5 Pro在可理解性方面表现出色。对于寻求川崎病信息的家长,最推荐使用Claude 3.5 Sonnet的PF提示。随着人工智能的发展,扩大研究和完善模型对于确保可靠、高质量的信息至关重要。