DiDonna Nicole, Shetty Pragna N, Khan Kamran, Damitz Lynn
From the School of Medicine, University of North Carolina, Chapel Hill, N.C.
Division of Plastic and Reconstructive Surgery, University of North Carolina, Chapel Hill, N.C.
Plast Reconstr Surg Glob Open. 2024 Jun 21;12(6):e5929. doi: 10.1097/GOX.0000000000005929. eCollection 2024 Jun.
Within the last few years, artificial intelligence (AI) chatbots have sparked fascination for their potential as an educational tool. Although it has been documented that one such chatbot, ChatGPT, is capable of performing at a moderate level on plastic surgery examinations and has the capacity to become a beneficial educational tool, the potential of other chatbots remains unexplored.
To investigate the efficacy of AI chatbots in plastic surgery education, performance on the 2019-2023 Plastic Surgery In-service Training Examination (PSITE) was compared among seven popular AI platforms: ChatGPT-3.5, ChatGPT-4.0, Google Bard, Google PaLM, Microsoft Bing AI, Claude, and My AI by Snapchat. Answers were evaluated for accuracy and incorrect responses were characterized by question category and error type.
ChatGPT-4.0 outperformed the other platforms, reaching accuracy rates up to 79%. On the 2023 PSITE, ChatGPT-4.0 ranked in the 95th percentile of first-year residents; however, relative performance worsened when compared with upper-level residents, with the platform ranking in the 12th percentile of sixth-year residents. The performance among other chatbots was comparable, with their average PSITE score (2019-2023) ranging from 48.6% to 57.0%.
Results of our study indicate that ChatGPT-4.0 has potential as an educational tool in the field of plastic surgery; however, given their poor performance on the PSITE, the use of other chatbots should be cautioned against at this time. To our knowledge, this is the first article comparing the performance of multiple AI chatbots within the realm of plastic surgery education.
在过去几年中,人工智能(AI)聊天机器人因其作为教育工具的潜力而引发了人们的兴趣。尽管有文献记载,像ChatGPT这样的一款聊天机器人在整形外科考试中能够达到中等水平,并且有潜力成为一种有益的教育工具,但其他聊天机器人的潜力仍未得到探索。
为了研究人工智能聊天机器人在整形外科教育中的效果,我们比较了七个流行的人工智能平台在2019 - 2023年整形外科在职培训考试(PSITE)中的表现,这七个平台分别是:ChatGPT - 3.5、ChatGPT - 4.0、谷歌巴德、谷歌帕姆、微软必应人工智能、克劳德和Snapchat的My AI。我们评估了答案的准确性,并按问题类别和错误类型对错误回答进行了分类。
ChatGPT - 4.0的表现优于其他平台,准确率高达79%。在2023年的PSITE考试中,ChatGPT - 4.0在一年级住院医师中排名第95百分位;然而,与高年级住院医师相比,其相对表现有所下降,该平台在六年级住院医师中排名第12百分位。其他聊天机器人的表现相当,它们在PSITE(2019 - 2023年)的平均得分在48.6%至57.0%之间。
我们的研究结果表明,ChatGPT - 4.0在整形外科领域有作为教育工具的潜力;然而,鉴于它们在PSITE考试中的表现不佳,目前应谨慎使用其他聊天机器人。据我们所知,这是第一篇比较多个人工智能聊天机器人在整形外科教育领域表现的文章。