Flynn Jason C, Zeitlin Jacob, Arango Sebastian D, Pineda Nathaniel, Miller Andrew J, Weir Tristan B
Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA.
Department of Orthopaedics, Philadelphia Hand to Shoulder Center, Philadelphia, USA.
Cureus. 2024 Sep 25;16(9):e70205. doi: 10.7759/cureus.70205. eCollection 2024 Sep.
Multimodal large language models (MLLMs), such as OpenAI's ChatGPT (San Francisco, CA), have the potential to improve medical care delivery and education, although important shortcomings in accuracy and image interpretation have been noted. The aim of this study was to assess the multimodal performance of a ChatGPT model customized with hand surgery-specific knowledge.
A customized generative pre-trained transformer (GPT) was trained using peer-reviewed literature recommended by the American Society for Surgery of the Hand (ASSH). Questions were taken from the ASSH 2022 Self-Assessment Examination (SAE). GPT-4 and the customized GPT were asked text-based multiple-choice questions. The customized GPT was also asked image-containing questions, both with and without access to the image(s) associated with each question.
A total of 192 questions were included. The customized GPT responded to the 119 text-only questions with greater accuracy than GPT-4 (107 (89.9%) versus 91 (76.5%), P = 0.001). Human examinees answered 87.3% (IQR: 71.6-93.7%) of the same text-based questions correctly. Of the 73 questions with images, the customized GPT answered 55 (75.3%) questions correctly, which dropped to 51 (69.9%) when the images were withheld (P = 0.317). The human examinees answered 87.2% (IQR: 79.4-95.4%) of image-based questions correctly.
Our findings suggest significant improvements in ChatGPT's ability to answer text-based hand surgery questions with hand-specific training. ChatGPT is still limited in its ability to interpret images to answer questions related to hand conditions. These data show hand surgeons can create customized GPT models to provide tailored answers to specific questions, which may serve as the foundation for educational and clinical tools.
多模态大语言模型(MLLM),如OpenAI的ChatGPT(加利福尼亚州旧金山),有潜力改善医疗服务和教育,尽管已经注意到其在准确性和图像解释方面存在重要缺陷。本研究的目的是评估一个用手外科特定知识定制的ChatGPT模型的多模态性能。
使用美国手外科协会(ASSH)推荐的同行评审文献训练一个定制的生成式预训练变换器(GPT)。问题取自ASSH 2022年自我评估考试(SAE)。向GPT-4和定制的GPT提出基于文本的多项选择题。还向定制的GPT提出包含图像的问题,分别在可访问和不可访问与每个问题相关的图像的情况下进行提问。
共纳入192个问题。定制的GPT对119个纯文本问题的回答准确率高于GPT-4(107个(89.9%)对91个(76.5%),P = 0.001)。人类考生对相同的基于文本的问题回答正确率为87.3%(四分位距:71.6 - 93.7%)。在73个有图像的问题中,定制的GPT正确回答了55个(75.3%)问题,当不提供图像时降至51个(69.9%)(P = 0.317)。人类考生对基于图像的问题回答正确率为87.2%(四分位距:79.4 - 95.4%)。
我们的研究结果表明,通过手部特定训练,ChatGPT回答基于文本的手外科问题的能力有显著提高。ChatGPT在解释图像以回答与手部情况相关问题的能力方面仍然有限。这些数据表明,手外科医生可以创建定制的GPT模型,为特定问题提供量身定制的答案,这可能成为教育和临床工具的基础。