Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea.
Clin Orthop Surg. 2024 Aug;16(4):669-673. doi: 10.4055/cios23179. Epub 2024 Mar 7.
The application of artificial intelligence and large language models in the medical field requires an evaluation of their accuracy in providing medical information. This study aimed to assess the performance of Chat Generative Pre-trained Transformer (ChatGPT) models 3.5 and 4 in solving orthopedic board-style questions.
A total of 160 text-only questions from the Orthopedic Surgery Department at Seoul National University Hospital, conforming to the format of the Korean Orthopedic Association board certification examinations, were input into the ChatGPT 3.5 and ChatGPT 4 programs. The questions were divided into 11 subcategories. The accuracy rates of the initial answers provided by Chat GPT 3.5 and ChatGPT 4 were analyzed. In addition, inconsistency rates of answers were evaluated by regenerating the responses.
ChatGPT 3.5 answered 37.5% of the questions correctly, while ChatGPT 4 showed an accuracy rate of 60.0% ( < 0.001). ChatGPT 4 demonstrated superior performance across most subcategories, except for the tumor-related questions. The rates of inconsistency in answers were 47.5% for ChatGPT 3.5 and 9.4% for ChatGPT 4.
ChatGPT 4 showed the ability to pass orthopedic board-style examinations, outperforming ChatGPT 3.5 in accuracy rate. However, inconsistencies in response generation and instances of incorrect answers with misleading explanations require caution when applying ChatGPT in clinical settings or for educational purposes.
人工智能和大型语言模型在医学领域的应用需要评估其提供医学信息的准确性。本研究旨在评估 Chat Generative Pre-trained Transformer(ChatGPT)模型 3.5 和 4 在解决骨科板样式问题方面的性能。
将首尔国立大学医院骨科的 160 个仅文本问题输入到 ChatGPT 3.5 和 ChatGPT 4 程序中,这些问题符合韩国骨科协会委员会认证考试的格式。问题分为 11 个子类别。分析 ChatGPT 3.5 和 ChatGPT 4 最初提供的答案的准确率。此外,通过重新生成响应来评估答案不一致率。
ChatGPT 3.5 正确回答了 37.5%的问题,而 ChatGPT 4 的准确率为 60.0%(<0.001)。ChatGPT 4 在大多数子类别中表现出色,除了与肿瘤相关的问题。ChatGPT 3.5 的答案不一致率为 47.5%,ChatGPT 4 的答案不一致率为 9.4%。
ChatGPT 4 能够通过骨科板样式考试,在准确率方面优于 ChatGPT 3.5。然而,在生成响应时的不一致和给出错误答案并带有误导性解释的情况,在将 ChatGPT 应用于临床环境或教育目的时需要谨慎。