• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

人工智能之战:利用大语言模型解决骨关节炎感染的临床病例

Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models.

作者信息

Borgonovo Fabio, Matsuo Takahiro, Petri Francesco, Amin Alavi Seyed Mohammad, Mazudie Ndjonko Laura Chelsea, Gori Andrea, Berbari Elie F

机构信息

Division of Public Health, Infectious Diseases and Occupational Medicine, Department of Medicine, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN.

Department of Infectious Diseases, "Luigi Sacco" University Hospital, Milan, Italy.

出版信息

Mayo Clin Proc Digit Health. 2025 May 23;3(3):100230. doi: 10.1016/j.mcpdig.2025.100230. eCollection 2025 Sep.

DOI:10.1016/j.mcpdig.2025.100230
PMID:40583928
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12205795/
Abstract

OBJECTIVE

To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines.

MATERIALS AND METHODS

The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions.

RESULTS

The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models.

CONCLUSION

OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.

摘要

目的

根据已发布的指南,评估15种不同的大语言模型(LLM)解决骨关节炎感染临床病例的能力。

材料与方法

本研究评估了15种LLM在5类骨关节炎感染中的表现:人工关节周围感染、糖尿病足感染、原发性椎体骨髓炎、骨折相关感染和化脓性关节炎。系统地选择了模型,包括通用系统和医学专用系统,确保对英语有强大的支持。作者根据已发布的指南编写并经专家验证的总共126个基于文本的问题,评估了诊断、管理和治疗策略。每个模型单独作答,根据指南将回答分类为正确或错误。所有测试于2025年4月17日至2025年4月28日进行。结果以正确答案的百分比和综合得分呈现,突出了性能趋势。使用具有随机问题效应的混合效应逻辑回归来量化每个LLM在回答研究问题时的比较情况。

结果

评估了15种LLM的性能,并报告了正确答案的百分比。OpenEvidence和Microsoft Copilot获得了最高分(126题中答对119题[94.4%]),在多个类别中表现出色。ChatGPT-4o和Gemini 2.5 Pro在126题中答对117题(92.8%)。当用作参考时,OpenEvidence不逊色于任何比较对象,且优于5种LLM。不同类别之间的性能存在差异,突出了各个模型的优势和局限性。

结论

OpenEvidence和Microsoft Copilot在评估的LLM中准确性最高,突出了它们精确解决复杂临床病例的潜力。本研究强调了在医学实践中需要专门的、经过验证的人工智能工具。尽管前景广阔,但当前模型在实际应用中仍面临局限性,需要进一步改进以可靠地支持临床决策。

相似文献

1
Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models.人工智能之战:利用大语言模型解决骨关节炎感染的临床病例
Mayo Clin Proc Digit Health. 2025 May 23;3(3):100230. doi: 10.1016/j.mcpdig.2025.100230. eCollection 2025 Sep.
2
Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management.大语言模型在回答牙周根分叉病变管理临床问题中的性能评估
Dent J (Basel). 2025 Jun 18;13(6):271. doi: 10.3390/dj13060271.
3
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究
JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.
4
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
5
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
6
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理:横断面评估研究
J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.
7
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施:系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。
Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.
8
A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。
BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.
9
Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection.用于 SARS-CoV-2 感染诊断的快速、即时抗原检测。
Cochrane Database Syst Rev. 2022 Jul 22;7(7):CD013705. doi: 10.1002/14651858.CD013705.pub3.
10
A Comparative Analysis of the Accuracy and Readability of Popular Artificial Intelligence-Chat Bots for Inguinal Hernia Management.用于腹股沟疝管理的流行人工智能聊天机器人的准确性和可读性比较分析。
Am Surg. 2025 Jun 25:31348251353065. doi: 10.1177/00031348251353065.

本文引用的文献

1
Integrating Clinical Guidelines With ChatGPT-4 Enhances Its' Skills.将临床指南与ChatGPT-4相结合可提升其技能。
Mayo Clin Proc Digit Health. 2024 Mar 28;2(2):177-180. doi: 10.1016/j.mcpdig.2024.02.004. eCollection 2024 Jun.
2
Learning to Fake It: Limited Responses and Fabricated References Provided by ChatGPT for Medical Questions.学会伪装:ChatGPT对医学问题的有限回答与编造参考文献
Mayo Clin Proc Digit Health. 2023 Jun 12;1(3):226-234. doi: 10.1016/j.mcpdig.2023.05.004. eCollection 2023 Sep.
3
ChatGPT Performance in the UK Medical Licensing Assessment: How to Train the Next Generation?
ChatGPT在英国医学执照评估中的表现:如何培养下一代?
Mayo Clin Proc Digit Health. 2023 Jul 7;1(3):309-310. doi: 10.1016/j.mcpdig.2023.06.004. eCollection 2023 Sep.
4
Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis.用于医学诊断的语言模型中的视觉文本整合:初步定量分析
Comput Struct Biotechnol J. 2024 Dec 22;27:184-189. doi: 10.1016/j.csbj.2024.12.019. eCollection 2025.
5
Generating credible referenced medical research: A comparative study of openAI's GPT-4 and Google's gemini.生成可信的引用医学研究:OpenAI的GPT-4与谷歌的Gemini的比较研究
Comput Biol Med. 2025 Feb;185:109545. doi: 10.1016/j.compbiomed.2024.109545. Epub 2024 Dec 12.
6
Large Language Models in Neurosurgery.大语言模型在神经外科中的应用。
Adv Exp Med Biol. 2024;1462:177-198. doi: 10.1007/978-3-031-64892-2_11.
7
Expert-Guided Large Language Models for Clinical Decision Support in Precision Oncology.专家指导的大型语言模型在精准肿瘤学中的临床决策支持。
JCO Precis Oncol. 2024 Oct;8:e2400478. doi: 10.1200/PO-24-00478. Epub 2024 Oct 30.
8
Benchmarking Large Language Models in Evidence-Based Medicine.基于循证医学的大型语言模型基准测试。
IEEE J Biomed Health Inform. 2024 Oct 21;PP. doi: 10.1109/JBHI.2024.3483816.
9
Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook.医疗保健中的多模态大型语言模型:应用、挑战和未来展望。
J Med Internet Res. 2024 Sep 25;26:e59505. doi: 10.2196/59505.
10
The potential for large language models to transform cardiovascular medicine.大语言模型改变心血管医学的潜力。
Lancet Digit Health. 2024 Oct;6(10):e767-e771. doi: 10.1016/S2589-7500(24)00151-1. Epub 2024 Aug 29.