Lum Zachary C
Nova Southeastern University, Davie, FL, USA.
Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.
Advances in neural networks, deep learning, and artificial intelligence (AI) have progressed recently. Previous deep learning AI has been structured around domain-specific areas that are trained on dataset-specific areas of interest that yield high accuracy and precision. A new AI model using large language models (LLM) and nonspecific domain areas, ChatGPT (OpenAI), has gained attention. Although AI has demonstrated proficiency in managing vast amounts of data, implementation of that knowledge remains a challenge.
QUESTIONS/PURPOSES: (1) What percentage of Orthopaedic In-Training Examination questions can a generative, pretrained transformer chatbot (ChatGPT) answer correctly? (2) How does that percentage compare with results achieved by orthopaedic residents of different levels, and if scoring lower than the 10th percentile relative to 5th-year residents is likely to correspond to a failing American Board of Orthopaedic Surgery score, is this LLM likely to pass the orthopaedic surgery written boards? (3) Does increasing question taxonomy affect the LLM's ability to select the correct answer choices?
This study randomly selected 400 of 3840 publicly available questions based on the Orthopaedic In-Training Examination and compared the mean score with that of residents who took the test over a 5-year period. Questions with figures, diagrams, or charts were excluded, including five questions the LLM could not provide an answer for, resulting in 207 questions administered with raw score recorded. The LLM's answer results were compared with the Orthopaedic In-Training Examination ranking of orthopaedic surgery residents. Based on the findings of an earlier study, a pass-fail cutoff was set at the 10th percentile. Questions answered were then categorized based on the Buckwalter taxonomy of recall, which deals with increasingly complex levels of interpretation and application of knowledge; comparison was made of the LLM's performance across taxonomic levels and was analyzed using a chi-square test.
ChatGPT selected the correct answer 47% (97 of 207) of the time, and 53% (110 of 207) of the time it answered incorrectly. Based on prior Orthopaedic In-Training Examination testing, the LLM scored in the 40th percentile for postgraduate year (PGY) 1s, the eighth percentile for PGY2s, and the first percentile for PGY3s, PGY4s, and PGY5s; based on the latter finding (and using a predefined cutoff of the 10th percentile of PGY5s as the threshold for a passing score), it seems unlikely that the LLM would pass the written board examination. The LLM's performance decreased as question taxonomy level increased (it answered 54% [54 of 101] of Tax 1 questions correctly, 51% [18 of 35] of Tax 2 questions correctly, and 34% [24 of 71] of Tax 3 questions correctly; p = 0.034).
Although this general-domain LLM has a low likelihood of passing the orthopaedic surgery board examination, testing performance and knowledge are comparable to that of a first-year orthopaedic surgery resident. The LLM's ability to provide accurate answers declines with increasing question taxonomy and complexity, indicating a deficiency in implementing knowledge.
Current AI appears to perform better at knowledge and interpretation-based inquires, and based on this study and other areas of opportunity, it may become an additional tool for orthopaedic learning and education.
神经网络、深度学习和人工智能(AI)近年来取得了进展。以往的深度学习AI围绕特定领域构建,在特定数据集的感兴趣领域上进行训练,从而产生高精度和高精准度。一种使用大语言模型(LLM)和非特定领域的新型AI模型ChatGPT(OpenAI)引起了关注。尽管AI已证明在管理大量数据方面表现出色,但将这些知识应用于实际仍面临挑战。
问题/目的:(1)生成式预训练变换器聊天机器人(ChatGPT)能正确回答骨科住院医师培训考试问题的百分比是多少?(2)该百分比与不同水平的骨科住院医师的成绩相比如何?如果相对于五年级住院医师得分低于第10百分位数可能对应美国骨科医师委员会考试不及格,那么这个大语言模型是否有可能通过骨外科笔试?(3)增加问题分类法是否会影响大语言模型选择正确答案选项的能力?
本研究基于骨科住院医师培训考试从3840个公开可用问题中随机选择400个,并将平均得分与5年内参加考试的住院医师的得分进行比较。排除带有图形、图表或示意图的问题,包括大语言模型无法回答的5个问题,最终得到207个记录原始分数的问题。将大语言模型的答案结果与骨科住院医师的骨科住院医师培训考试排名进行比较。根据早期研究的结果,将及格/不及格的临界值设定为第10百分位数。然后根据巴克沃尔特回忆分类法对回答的问题进行分类,该分类法涉及知识解释和应用的复杂程度不断增加的水平;比较大语言模型在不同分类水平上的表现,并使用卡方检验进行分析。
ChatGPT在207个问题中47%(97个)的时间选择了正确答案,53%(110个)的时间回答错误。根据之前的骨科住院医师培训考试测试,大语言模型在第一年住院医师(PGY1)中得分处于第40百分位数,在PGY2中处于第8百分位数,在PGY3、PGY4和PGY5中处于第1百分位数;基于后一个发现(并使用将PGY5的第10百分位数作为及格分数的预定义临界值),大语言模型似乎不太可能通过笔试。随着问题分类水平的提高,大语言模型的表现下降(它正确回答了54%[101个中的54个]的1类问题、51%[35个中的18个]的2类问题和34%[71个中的24个]的3类问题;p = 0.034)。
尽管这种通用领域的大语言模型通过骨外科委员会考试的可能性很低,但其测试表现和知识水平与第一年骨外科住院医师相当。随着问题分类法和复杂性的增加,大语言模型提供准确答案的能力下降,表明在知识应用方面存在不足。
当前的AI在基于知识和解释的询问方面似乎表现更好,基于本研究和其他机会领域,它可能成为骨科学习和教育的额外工具。