• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大型语言模型作为川崎病医疗咨询辅助工具的作用。

Assessing large language models as assistive tools in medical consultations for Kawasaki disease.

作者信息

Yan Chunyi, Li Zexi, Liang Yongzhou, Shao Shuran, Ma Fan, Zhang Nanjun, Li Bowen, Wang Chuan, Zhou Kaiyu

机构信息

Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China.

Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China.

出版信息

Front Artif Intell. 2025 Mar 31;8:1571503. doi: 10.3389/frai.2025.1571503. eCollection 2025.

DOI:10.3389/frai.2025.1571503
PMID:40231209
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11994668/
Abstract

BACKGROUND

Kawasaki disease (KD) presents complex clinical challenges in diagnosis, treatment, and long-term management, requiring a comprehensive understanding by both parents and healthcare providers. With advancements in artificial intelligence (AI), large language models (LLMs) have shown promise in supporting medical practice. This study aims to evaluate and compare the appropriateness and comprehensibility of different LLMs in answering clinically relevant questions about KD and assess the impact of different prompting strategies.

METHODS

Twenty-five questions were formulated, incorporating three prompting strategies: No prompting (NO), Parent-friendly (PF), and Doctor-level (DL). These questions were input into three LLMs: ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Responses were evaluated based on appropriateness, educational quality, comprehensibility, cautionary statements, references, and potential misinformation, using Information Quality Grade, Global Quality Scale (GQS), Flesch Reading Ease (FRE) score, and word count.

RESULTS

Significant differences were found among the LLMs in terms of response educational quality, accuracy, and comprehensibility ( < 0.001). Claude 3.5 provided the highest proportion of completely correct responses (51.1%) and achieved the highest median GQS score (5.0), outperforming GPT-4o (4.0) and Gemini 1.5 (3.0) significantly. Gemini 1.5 achieved the highest FRE score (31.5) and provided highest proportion of responses assessed as comprehensible (80.4%). Prompting strategies significantly affected LLM responses. Claude 3.5 Sonnet with DL prompting had the highest completely correct rate (81.3%), while PF prompting yielded the most acceptable responses (97.3%). Gemini 1.5 Pro showed minimal variation across prompts but excelled in comprehensibility (98.7% under PF prompting).

CONCLUSION

This study indicates that LLMs have great potential in providing information about KD, but their use requires caution due to quality inconsistencies and misinformation risks. Significant discrepancies existed across LLMs and prompting strategies. Claude 3.5 Sonnet offered the best response quality and accuracy, while Gemini 1.5 Pro excelled in comprehensibility. PF prompting with Claude 3.5 Sonnet is most recommended for parents seeking KD information. As AI evolves, expanding research and refining models is crucial to ensure reliable, high-quality information.

摘要

背景

川崎病(KD)在诊断、治疗和长期管理方面带来了复杂的临床挑战,这需要家长和医疗服务提供者全面了解。随着人工智能(AI)的发展,大语言模型(LLMs)在支持医疗实践方面显示出了前景。本研究旨在评估和比较不同大语言模型在回答有关川崎病的临床相关问题时的恰当性和可理解性,并评估不同提示策略的影响。

方法

制定了25个问题,纳入了三种提示策略:无提示(NO)、家长友好型(PF)和医生水平型(DL)。这些问题被输入到三个大语言模型中:ChatGPT-4o、Claude 3.5 Sonnet和Gemini 1.5 Pro。使用信息质量等级、全球质量量表(GQS)、弗莱什易读性(FRE)得分和单词计数,根据恰当性、教育质量、可理解性、警示声明、参考文献和潜在错误信息对回答进行评估。

结果

在回答的教育质量、准确性和可理解性方面,大语言模型之间存在显著差异(<0.001)。Claude 3.5给出的完全正确回答比例最高(51.1%),并且获得了最高的GQS中位数得分(5.0),显著优于GPT-4o(4.0)和Gemini 1.5(3.0)。Gemini 1.5获得了最高的FRE得分(31.5),并且给出的被评估为可理解的回答比例最高(80.4%)。提示策略对大语言模型的回答有显著影响。采用DL提示的Claude 3.5 Sonnet完全正确率最高(81.3%),而PF提示产生的可接受回答最多(97.3%)。Gemini 1.5 Pro在不同提示下变化最小,但在可理解性方面表现出色(PF提示下为98.7%)。

结论

本研究表明,大语言模型在提供有关川崎病的信息方面具有巨大潜力,但由于质量不一致和错误信息风险,其使用需要谨慎。不同大语言模型和提示策略之间存在显著差异。Claude 3.5 Sonnet提供了最佳的回答质量和准确性,而Gemini 1.5 Pro在可理解性方面表现出色。对于寻求川崎病信息的家长,最推荐使用Claude 3.5 Sonnet的PF提示。随着人工智能的发展,扩大研究和完善模型对于确保可靠、高质量的信息至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/ec94ce5aa600/frai-08-1571503-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/0604410f2447/frai-08-1571503-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/c914a4d14a1d/frai-08-1571503-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/6f7a8f243c86/frai-08-1571503-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/ec94ce5aa600/frai-08-1571503-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/0604410f2447/frai-08-1571503-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/c914a4d14a1d/frai-08-1571503-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/6f7a8f243c86/frai-08-1571503-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2c5/11994668/ec94ce5aa600/frai-08-1571503-g004.jpg

相似文献

1
Assessing large language models as assistive tools in medical consultations for Kawasaki disease.评估大型语言模型作为川崎病医疗咨询辅助工具的作用。
Front Artif Intell. 2025 Mar 31;8:1571503. doi: 10.3389/frai.2025.1571503. eCollection 2025.
2
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
3
Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures.大型语言模型在牙科手术中预防感染性心内膜炎的准确性。
Int Dent J. 2025 Feb;75(1):206-212. doi: 10.1016/j.identj.2024.09.033. Epub 2024 Oct 12.
4
Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.多模态大语言模型在放射学问答病例中的诊断性能:提示工程和输入条件的影响
Ultrasonography. 2025 May;44(3):220-231. doi: 10.14366/usg.25012. Epub 2025 Mar 11.
5
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
6
Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam.评估牙科麻醉学中的大语言模型:ChatGPT-4、Claude 3 Opus和Gemini 1.0在日本麻醉学牙科协会委员会认证考试中的比较分析。
Cureus. 2024 Sep 27;16(9):e70302. doi: 10.7759/cureus.70302. eCollection 2024 Sep.
7
Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.用土耳其医学肿瘤学会年度委员会考试问题对大型语言模型聊天机器人的肿瘤学知识进行基准测试。
BMC Cancer. 2025 Feb 4;25(1):197. doi: 10.1186/s12885-025-13596-0.
8
Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis.ChatGPT-4o mini、ChatGPT-4o与Gemini Advanced在绝经后骨质疏松症治疗中的对比分析。
BMC Musculoskelet Disord. 2025 Apr 16;26(1):369. doi: 10.1186/s12891-025-08601-3.
9
Patient- and clinician-based evaluation of large language models for patient education in prostate cancer radiotherapy.基于患者和临床医生的大语言模型在前列腺癌放疗患者教育中的评估
Strahlenther Onkol. 2025 Mar;201(3):333-342. doi: 10.1007/s00066-024-02342-3. Epub 2025 Jan 10.
10
Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology.评估Microsoft Copilot、GPT-4和Google Gemini在眼科领域的性能。
Can J Ophthalmol. 2025 Feb 4. doi: 10.1016/j.jcjo.2025.01.001.

本文引用的文献

1
Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude for Readability and Accuracy.对ChatGPT、Gemini和Claude提供的隆鼻整形信息的可读性和准确性评估。
Aesthetic Plast Surg. 2025 Apr;49(7):1868-1873. doi: 10.1007/s00266-024-04343-0. Epub 2024 Sep 16.
2
Application of artificial intelligence in the diagnosis and treatment of Kawasaki disease.人工智能在川崎病诊断与治疗中的应用。
World J Clin Cases. 2024 Aug 16;12(23):5304-5307. doi: 10.12998/wjcc.v12.i23.5304.
3
Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control.
ChatGPT、Claude和Bard在支持近视防控方面的性能比较。
J Multidiscip Healthc. 2024 Aug 13;17:3917-3929. doi: 10.2147/JMDH.S473680. eCollection 2024.
4
End-of-life Care Patient Information Leaflets-A Comparative Evaluation of Artificial Intelligence-generated Content for Readability, Sentiment, Accuracy, Completeness, and Suitability: ChatGPT vs Google Gemini.临终关怀患者信息手册——人工智能生成内容在可读性、情感倾向、准确性、完整性和适用性方面的比较评估:ChatGPT与谷歌Gemini对比
Indian J Crit Care Med. 2024 Jun;28(6):561-568. doi: 10.5005/jp-journals-10071-24725.
5
Assessing ChatGPT as a Medical Consultation Assistant for Chronic Hepatitis B: Cross-Language Study of English and Chinese.评估ChatGPT作为慢性乙型肝炎医疗咨询助手:英语和中文的跨语言研究
JMIR Med Inform. 2024 Aug 8;12:e56426. doi: 10.2196/56426.
6
Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases.评估新型工具 Claude 3 与 ChatGPT 4.0 作为原发性头颈部癌症病例诊断和治疗的人工智能工具的使用情况。
Eur Arch Otorhinolaryngol. 2024 Nov;281(11):6099-6109. doi: 10.1007/s00405-024-08828-1. Epub 2024 Aug 7.
7
Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis.用于黑色素瘤诊断的皮肤镜图像分析中Claude 3 Opus和配备GPT-4的ChatGPT:比较性能分析
JMIR Med Inform. 2024 Aug 6;12:e59273. doi: 10.2196/59273.
8
Evaluating ChatGPT-4V in chest CT diagnostics: a critical image interpretation assessment.评估 ChatGPT-4V 在胸部 CT 诊断中的应用:一项关键的图像解读评估。
Jpn J Radiol. 2024 Oct;42(10):1168-1177. doi: 10.1007/s11604-024-01606-3. Epub 2024 Jun 13.
9
ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review.ChatGPT 提示在医学教育中生成多项选择题及其有效性的证据:文献综述。
Postgrad Med J. 2024 Oct 18;100(1189):858-865. doi: 10.1093/postmj/qgae065.
10
Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy.评估大语言模型(ChatGPT-4、Gemini和Microsoft Copilot)对乳腺成像常见问题的回答:可读性和准确性研究
Cureus. 2024 May 9;16(5):e59960. doi: 10.7759/cureus.59960. eCollection 2024 May.