• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用不同的提示策略和语言评估大语言模型在房颤管理方面的性能。

Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages.

作者信息

Li Zexi, Yan Chunyi, Cao Ying, Gong Aobo, Li Fanghui, Zeng Rui

机构信息

Department of Cardiology, West China Hospital, Sichuan University, Chengdu, 610041, Sichuan, China.

Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, 610041, Sichuan, China.

出版信息

Sci Rep. 2025 May 30;15(1):19028. doi: 10.1038/s41598-025-04309-5.

DOI:10.1038/s41598-025-04309-5
PMID:40447746
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12125184/
Abstract

This study evaluated large language models (LLMs) using 30 questions, each derived from a recommendation in the 2024 European Society of Cardiology (ESC) guidelines for atrial fibrillation (AF) management. These recommendations were stratified by class of recommendation and level of evidence. The primary objective was to assess the reliability and consistency of LLM-generated classifications compared to those in the ESC guidelines. Additionally, the study assessed the impact of different prompting strategies and working languages on LLM performance. Three prompting strategies were tested: Input-output (IO), 0-shot-Chain of thought (0-COT) and Performed-Chain of thought (P-COT) prompting. Each question, presented in both English and Chinese, was input into three LLMs: ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The reliability of the different LLM-prompt combinations showed moderate to substantial agreement (Fleiss kappa ranged from 0.449 to 0.763). Claude 3.5 with P-COT prompting had the highest recommendation classification consistency (60.3%). No significant differences were observed between English and Chinese across most LLM-prompt combinations. Bias analysis of inconsistent outcomes revealed a propensity towards more recommended treatments and stronger evidence levels across most LLM-prompt combinations. The characteristics of clinical questions potentially influence LLM performance. This study highlights the limitations in the accuracy of LLM responses to AF-related questions. To gather more comprehensive insights, conducting repeated queries is advisable. Future efforts should focus on expanding the use of diverse prompting strategies, conducting ongoing model evaluation and refinement, and establishing a comprehensive, objective benchmarking system.

摘要

本研究使用30个问题对大语言模型(LLMs)进行了评估,每个问题均源自2024年欧洲心脏病学会(ESC)心房颤动(AF)管理指南中的一项建议。这些建议按推荐类别和证据水平进行了分层。主要目标是评估大语言模型生成的分类与ESC指南中的分类相比的可靠性和一致性。此外,该研究还评估了不同提示策略和工作语言对大语言模型性能的影响。测试了三种提示策略:输入-输出(IO)、零样本思维链(0-COT)和执行思维链(P-COT)提示。每个问题均以英文和中文呈现,并输入到三个大语言模型中:ChatGPT-4o、Claude 3.5 Sonnet和Gemini 1.5 Pro。不同大语言模型-提示组合的可靠性显示出中度到高度的一致性(Fleiss卡方值范围为0.449至0.763)。采用P-COT提示的Claude 3.5具有最高的推荐分类一致性(60.3%)。在大多数大语言模型-提示组合中,英文和中文之间未观察到显著差异。对不一致结果的偏差分析显示,在大多数大语言模型-提示组合中,倾向于更多推荐的治疗方法和更强的证据水平。临床问题的特征可能会影响大语言模型的性能。本研究突出了大语言模型对AF相关问题回答准确性的局限性。为了获得更全面的见解,建议进行重复查询。未来的工作应侧重于扩大不同提示策略的使用、持续进行模型评估和优化,以及建立一个全面、客观的基准系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/f874a0b8cd88/41598_2025_4309_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/413b91c9b276/41598_2025_4309_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/6fc4c522bf70/41598_2025_4309_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/6810d4f5bfbf/41598_2025_4309_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/f874a0b8cd88/41598_2025_4309_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/413b91c9b276/41598_2025_4309_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/6fc4c522bf70/41598_2025_4309_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/6810d4f5bfbf/41598_2025_4309_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b83f/12125184/f874a0b8cd88/41598_2025_4309_Fig4_HTML.jpg

相似文献

1
Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages.使用不同的提示策略和语言评估大语言模型在房颤管理方面的性能。
Sci Rep. 2025 May 30;15(1):19028. doi: 10.1038/s41598-025-04309-5.
2
Assessing large language models as assistive tools in medical consultations for Kawasaki disease.评估大型语言模型作为川崎病医疗咨询辅助工具的作用。
Front Artif Intell. 2025 Mar 31;8:1571503. doi: 10.3389/frai.2025.1571503. eCollection 2025.
3
Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval.通过提示工程和知识检索评估大语言模型在注册营养师考试中的准确性和一致性。
Sci Rep. 2025 Jan 9;15(1):1506. doi: 10.1038/s41598-024-85003-w.
4
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
5
Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education.评估大语言模型在药学教育中的重症护理评估方面的性能准确性和可重复性。
Front Artif Intell. 2025 Jan 9;7:1514896. doi: 10.3389/frai.2024.1514896. eCollection 2024.
6
Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures.大型语言模型在牙科手术中预防感染性心内膜炎的准确性。
Int Dent J. 2025 Feb;75(1):206-212. doi: 10.1016/j.identj.2024.09.033. Epub 2024 Oct 12.
7
High-performance automated abstract screening with large language model ensembles.使用大语言模型集成进行高性能自动摘要筛选。
J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.
8
Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.通过角色扮演提示增强大语言模型的回答:关于全膝关节置换术常见问题解答的比较研究
BMC Med Inform Decis Mak. 2025 May 23;25(1):196. doi: 10.1186/s12911-025-03024-5.
9
Large language model comparisons between English and Chinese query performance for cardiovascular prevention.心血管疾病预防中英查询性能的大语言模型比较。
Commun Med (Lond). 2025 May 16;5(1):177. doi: 10.1038/s43856-025-00802-0.
10
Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam.评估牙科麻醉学中的大语言模型:ChatGPT-4、Claude 3 Opus和Gemini 1.0在日本麻醉学牙科协会委员会认证考试中的比较分析。
Cureus. 2024 Sep 27;16(9):e70302. doi: 10.7759/cureus.70302. eCollection 2024 Sep.

本文引用的文献

1
A proof-of-concept study for patient use of open notes with large language models.一项关于患者使用带有大语言模型的开放病历的概念验证研究。
JAMIA Open. 2025 Apr 9;8(2):ooaf021. doi: 10.1093/jamiaopen/ooaf021. eCollection 2025 Apr.
2
Evaluating Large Language Models for Burning Mouth Syndrome Diagnosis.评估用于灼口综合征诊断的大语言模型。
J Pain Res. 2025 Mar 19;18:1387-1405. doi: 10.2147/JPR.S509845. eCollection 2025.
3
Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval.
通过提示工程和知识检索评估大语言模型在注册营养师考试中的准确性和一致性。
Sci Rep. 2025 Jan 9;15(1):1506. doi: 10.1038/s41598-024-85003-w.
4
Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions.ChatGPT-4o在日本医师执照考试中的表现:纯文本和基于图像问题的准确性评估。
JMIR Med Educ. 2024 Dec 24;10:e63129. doi: 10.2196/63129.
5
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
6
Polite AI mitigates user susceptibility to AI hallucinations.礼貌型人工智能可降低用户对人工智能幻觉的易感性。
Ergonomics. 2024 Nov 28:1-11. doi: 10.1080/00140139.2024.2434604.
7
Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions.韩国国家口腔卫生士考试中韩语和英语问题的大语言模型回答准确率的比较分析
Int J Dent Hyg. 2025 May;23(2):267-276. doi: 10.1111/idh.12848. Epub 2024 Oct 16.
8
2024 ESC Guidelines for the management of atrial fibrillation developed in collaboration with the European Association for Cardio-Thoracic Surgery (EACTS).2024年欧洲心脏病学会(ESC)心房颤动管理指南,与欧洲心胸外科学会(EACTS)联合制定。
Eur Heart J. 2024 Sep 29;45(36):3314-3414. doi: 10.1093/eurheartj/ehae176.
9
Assessing ChatGPT as a Medical Consultation Assistant for Chronic Hepatitis B: Cross-Language Study of English and Chinese.评估ChatGPT作为慢性乙型肝炎医疗咨询助手:英语和中文的跨语言研究
JMIR Med Inform. 2024 Aug 8;12:e56426. doi: 10.2196/56426.
10
Evaluating prompt engineering on GPT-3.5's performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4.评估 GPT-3.5 在 USMLE 式医学计算和由 GPT-4 生成的临床场景中的表现的提示工程。
Sci Rep. 2024 Jul 28;14(1):17341. doi: 10.1038/s41598-024-66933-x.