Suppr超能文献

GPT-3.5、GPT-4和GPT-4V之间的比较:大型语言模型(ChatGPT)能通过日本骨科手术委员会考试吗?

A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?

作者信息

Nakajima Nozomu, Fujimori Takahito, Furuya Masayuki, Kanie Yuya, Imai Hirotatsu, Kita Kosuke, Uemura Keisuke, Okada Seiji

机构信息

Orthopaedics, Sakai City Medical Center, Sakai, JPN.

Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN.

出版信息

Cureus. 2024 Mar 18;16(3):e56402. doi: 10.7759/cureus.56402. eCollection 2024 Mar.

Abstract

Introduction Recently, large-scale language models, such as ChatGPT (OpenAI, San Francisco, CA), have evolved. These models are designed to think and act like humans and possess a broad range of specialized knowledge. GPT-3.5 was reported to be at a level of passing the United States Medical Licensing Examination. Its capabilities continue to evolve, and in October 2023, GPT-4V became available as a model capable of image recognition. Therefore, it is important to know the current performance of these models because they will be soon incorporated into medical practice. We aimed to evaluate the performance of ChatGPT in the field of orthopedic surgery. Methods We used three years' worth of Japanese Board of Orthopaedic Surgery Examinations (JBOSE) conducted in 2021, 2022, and 2023. Questions and their multiple-choice answers were used in their original Japanese form, as was the official examination rubric. We inputted these questions into three versions of ChatGPT: GPT-3.5, GPT-4, and GPT-4V. For image-based questions, we inputted only textual statements for GPT-3.5 and GPT-4, and both image and textual statements for GPT-4V. As the minimum scoring rate acquired to pass is not officially disclosed, it was calculated using publicly available data. Results The estimated minimum scoring rate acquired to pass was calculated as 50.1% (43.7-53.8%). For GPT-4, even when answering all questions, including the image-based ones, the percentage of correct answers was 59% (55-61%) and GPT-4 was able to achieve the passing line. When excluding image-based questions, the score reached 67% (63-73%). For GPT-3.5, the percentage was limited to 30% (28-32%), and this version could not pass the examination. There was a significant difference in the performance between GPT-4 and GPT-3.5 (p < 0.001). For image-based questions, the percentage of correct answers was 25% in GPT-3.5, 38% in GPT-4, and 38% in GPT-4V. There was no significant difference in the performance for image-based questions between GPT-4 and GPT-4V. Conclusions ChatGPT had enough performance to pass the orthopedic specialist examination. After adding further training data such as images, ChatGPT is expected to be applied to the orthopedics field.

摘要

引言 最近,诸如ChatGPT(OpenAI,加利福尼亚州旧金山)之类的大规模语言模型得到了发展。这些模型旨在像人类一样思考和行动,并拥有广泛的专业知识。据报道,GPT-3.5达到了通过美国医学执照考试的水平。其能力不断发展,2023年10月,GPT-4V作为一种能够进行图像识别的模型问世。因此,了解这些模型的当前性能很重要,因为它们很快就会被纳入医疗实践。我们旨在评估ChatGPT在骨科手术领域的性能。

方法 我们使用了2021年、2022年和2023年进行的为期三年的日本骨科手术委员会考试(JBOSE)。问题及其多项选择题答案以原始日语形式使用,官方考试评分标准也是如此。我们将这些问题输入到ChatGPT的三个版本中:GPT-3.5、GPT-4和GPT-4V。对于基于图像的问题,我们仅为GPT-3.5和GPT-4输入文本陈述,为GPT-4V输入图像和文本陈述。由于官方未公布通过考试所需的最低得分率,因此使用公开数据进行计算。

结果 通过考试所需的估计最低得分率计算为50.1%(43.7-53.8%)。对于GPT-4,即使回答所有问题,包括基于图像的问题,正确答案的百分比为59%(55-61%),GPT-4能够达到及格线。排除基于图像的问题后,得分达到67%(63-73%)。对于GPT-3.5,该百分比仅限于30%(28-32%),此版本未能通过考试。GPT-4和GPT-3.5之间的性能存在显著差异(p<0.001)。对于基于图像的问题,GPT-3.5的正确答案百分比为25%,GPT-4为38%,GPT-4V为38%。GPT-4和GPT-4V在基于图像的问题上的性能没有显著差异。

结论 ChatGPT具备通过骨科专家考试的足够性能。在添加诸如图像等更多训练数据后,预计ChatGPT将应用于骨科领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bf65/11023708/9c505ebccfe1/cureus-0016-00000056402-i01.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验