Borgonovo Fabio, Matsuo Takahiro, Petri Francesco, Amin Alavi Seyed Mohammad, Mazudie Ndjonko Laura Chelsea, Gori Andrea, Berbari Elie F
Division of Public Health, Infectious Diseases and Occupational Medicine, Department of Medicine, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN.
Department of Infectious Diseases, "Luigi Sacco" University Hospital, Milan, Italy.
Mayo Clin Proc Digit Health. 2025 May 23;3(3):100230. doi: 10.1016/j.mcpdig.2025.100230. eCollection 2025 Sep.
To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines.
The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions.
The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models.
OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.
根据已发布的指南,评估15种不同的大语言模型(LLM)解决骨关节炎感染临床病例的能力。
本研究评估了15种LLM在5类骨关节炎感染中的表现:人工关节周围感染、糖尿病足感染、原发性椎体骨髓炎、骨折相关感染和化脓性关节炎。系统地选择了模型,包括通用系统和医学专用系统,确保对英语有强大的支持。作者根据已发布的指南编写并经专家验证的总共126个基于文本的问题,评估了诊断、管理和治疗策略。每个模型单独作答,根据指南将回答分类为正确或错误。所有测试于2025年4月17日至2025年4月28日进行。结果以正确答案的百分比和综合得分呈现,突出了性能趋势。使用具有随机问题效应的混合效应逻辑回归来量化每个LLM在回答研究问题时的比较情况。
评估了15种LLM的性能,并报告了正确答案的百分比。OpenEvidence和Microsoft Copilot获得了最高分(126题中答对119题[94.4%]),在多个类别中表现出色。ChatGPT-4o和Gemini 2.5 Pro在126题中答对117题(92.8%)。当用作参考时,OpenEvidence不逊色于任何比较对象,且优于5种LLM。不同类别之间的性能存在差异,突出了各个模型的优势和局限性。
OpenEvidence和Microsoft Copilot在评估的LLM中准确性最高,突出了它们精确解决复杂临床病例的潜力。本研究强调了在医学实践中需要专门的、经过验证的人工智能工具。尽管前景广阔,但当前模型在实际应用中仍面临局限性,需要进一步改进以可靠地支持临床决策。