Hubany Shannon S, Scala Fernanda D, Hashemi Kiana, Kapoor Saumya, Fedorova Julia R, Vaccaro Matthew J, Ridout Rees P, Hedman Casey C, Kellogg Brian C, Leto Barone Angelo A
From the University of Central Florida College of Medicine, Orlando, Fla.
Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla.
Plast Reconstr Surg Glob Open. 2024 Sep 5;12(9):e6136. doi: 10.1097/GOX.0000000000006136. eCollection 2024 Sep.
ChatGPT, launched in 2022 and updated to Generative Pre-trained Transformer 4 (GPT-4) in 2023, is a large language model trained on extensive data, including medical information. This study compares ChatGPT's performance on Plastic Surgery In-Service Examinations with medical residents nationally as well as its earlier version, ChatGPT-3.5.
This study reviewed 1500 questions from the Plastic Surgery In-service Examinations from 2018 to 2023. After excluding image-based, unscored, and inconclusive questions, 1292 were analyzed. The question stem and each multiple-choice answer was inputted verbatim into ChatGPT-4.
ChatGPT-4 correctly answered 961 (74.4%) of the included questions. Best performance by section was in core surgical principles (79.1% correct) and lowest in craniomaxillofacial (69.1%). ChatGPT-4 ranked between the 61st and 97th percentiles compared with all residents. Comparatively, ChatGPT-4 significantly outperformed ChatGPT-3.5 in 2018-2022 examinations ( < 0.001). Although ChatGPT-3.5 averaged 55.5% correctness, ChatGPT-4 averaged 74%, a mean difference of 18.54%. In 2021, ChatGPT-3.5 ranked in the 23rd percentile of all residents, whereas ChatGPT-4 ranked in the 97th percentile. ChatGPT-4 outperformed 80.7% of residents on average and scored above the 97th percentile among first-year residents. Its performance was comparable with sixth-year integrated residents, ranking in the 55.7th percentile, on average. These results show significant improvements in ChatGPT-4's application of medical knowledge within six months of ChatGPT-3.5's release.
This study reveals ChatGPT-4's rapid developments, advancing from a first-year medical resident's level to surpassing independent residents and matching a sixth-year resident's proficiency.
ChatGPT于2022年推出,并于2023年更新为生成式预训练变换器4(GPT-4),是一种基于包括医学信息在内的大量数据训练的大型语言模型。本研究将ChatGPT在整形外科在职考试中的表现与全国医学住院医师进行比较,并与其早期版本ChatGPT-3.5进行比较。
本研究回顾了2018年至2023年整形外科在职考试中的1500道题目。在排除基于图像、未计分和无定论的题目后,对1292道题目进行了分析。将题干和每个多项选择题答案逐字输入ChatGPT-4。
ChatGPT-4正确回答了961道(74.4%)纳入的题目。各部分表现最佳的是核心外科原则(正确率79.1%),最差的是颅颌面外科(69.1%)。与所有住院医师相比,ChatGPT-4的百分位排名在第61至97之间。相比之下,在2018 - 2022年考试中,ChatGPT-4的表现显著优于ChatGPT-3.5(<0.001)。虽然ChatGPT-3.5的平均正确率为55.5%,但ChatGPT-4的平均正确率为74%,平均差异为18.54%。2021年,ChatGPT-3.5在所有住院医师中排名第23百分位,而ChatGPT-4排名第97百分位。ChatGPT-4平均超过80.7%的住院医师,在一年级住院医师中得分高于第97百分位。其表现与六年级综合住院医师相当,平均排名第55.7百分位。这些结果表明,在ChatGPT-3.5发布后的六个月内,ChatGPT-4在医学知识应用方面有了显著改进。
本研究揭示了ChatGPT-4的快速发展,从一年级医学住院医师的水平提升到超过独立住院医师,并与六年级住院医师的熟练程度相当。