骨科手术中大型语言模型的系统评价

Systematic Review on Large Language Models in Orthopaedic Surgery.

作者信息

Mo Kevin, Lin Rowen, Dunn Evan, Girgis Gio, Fang William, Walsh John, Banyai-Flores Nicole, Watson Troy, Lee Daniel

机构信息

Orthopaedic Surgery, Valley Hospital Medical Center, 620 Shadow Ln, Las Vegas, NV 89106, USA.

Touro University Nevada College of Osteopathic Medicine, 874 American Pacific Dr, Henderson, NV 89104, USA.

出版信息

J Clin Med. 2025 Aug 20;14(16):5876. doi: 10.3390/jcm14165876.

DOI:10.3390/jcm14165876

PMID:40869701

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12386971/

Abstract

: Since ChatGPT was released in 2022, many Large Language Models (LLM) have been developed, showing potential to expand the field of orthopaedic surgery. This is the first systematic review looking at the current state of research of LLMs in orthopaedic surgery. The aim of this study is to identify which LLMs are researched, assess their functionalities, and evaluate their quality of results. : The systematic review was conducted using PubMed, Embase, and Cochrane Library databases in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. : A total of 60 studies were included in the final review, all of which included ChatGPT versions 3.0 or 4.0. There were five studies that included Bard and one article each for Perplexity AI and Bing. Most studies assessed orthopaedic assessment questions (23 studies) and their ability to correctly answer free ended questions (31 studies). Outcome measures used to assess the accuracy of LLMs in most of the included studies were the percentage of correct answers on multiple-choice questions or expert-graded consensus to open-ended responses. The accuracy of ChatGPT 4.0 in orthopaedic assessment questions ranged from 47.2 to 73.6% without images, and 35.7-65.85% with images. The accuracy of ChatGPT 3.5 was 29.4-55.8% without images and 22.4-46.34% with images. The accuracy of Bard ranged from 49.8 to 58%. Orthopaedic residents consistently scored better than LLMs in the range of 74.2-75.3%. : ChatGPT 4 showed significant improvement over ChatGPT 3.5 in answering orthopaedic assessment questions. When comparing performances of orthopaedic residents to LLMs, orthopaedic residents scored higher overall. There remains significant opportunity for development of LLM performance on orthopaedic assessments as well as image-based analysis and clinical documentation.

摘要

自2022年ChatGPT发布以来，许多大语言模型（LLM）已被开发出来，显示出在骨科手术领域拓展的潜力。这是第一篇审视骨科手术中LLM研究现状的系统综述。本研究的目的是确定哪些LLM被研究，评估其功能，并评价其结果质量。：根据系统综述和Meta分析的首选报告项目（PRISMA）指南，使用PubMed、Embase和Cochrane图书馆数据库进行系统综述。：最终综述共纳入60项研究，所有研究均涉及ChatGPT 3.0或4.0版本。有5项研究纳入了Bard，还有1篇文章分别涉及Perplexity AI和必应。大多数研究评估了骨科评估问题（23项研究）及其正确回答开放式问题的能力（31项研究）。在大多数纳入研究中，用于评估LLM准确性的结果指标是多项选择题的正确答案百分比或对开放式回答的专家评分共识。ChatGPT 4.0在无图像的骨科评估问题中的准确率为47.2%至73.6%，有图像时为35.7%至65.85%。ChatGPT 3.5无图像时的准确率为29.4%至55.8%，有图像时为22.4%至46.34%。Bard的准确率在49.8%至58%之间。骨科住院医师的得分在74.2%至75.3%范围内始终高于LLM。：ChatGPT 4在回答骨科评估问题方面比ChatGPT 3.5有显著改进。在比较骨科住院医师与LLM的表现时，骨科住院医师的总体得分更高。LLM在骨科评估以及基于图像的分析和临床文档方面的性能仍有很大的发展空间。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

骨科手术中大型语言模型的系统评价

Systematic Review on Large Language Models in Orthopaedic Surgery.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

骨科手术中大型语言模型的系统评价

Systematic Review on Large Language Models in Orthopaedic Surgery.

作者信息

机构信息

出版信息

相似文献

本文引用的文献