Suppr超能文献

评估ChatGPT、Gemini和必应在耳鼻咽喉科在职培训考试中与住院医师相比的表现。

Evaluating the Performance of ChatGPT, Gemini, and Bing Compared with Resident Surgeons in the Otorhinolaryngology In-service Training Examination.

作者信息

Mete Utku

机构信息

Bursa Uludağ University Faculty of Medicine, Department of Otorhinolaryngology, Bursa, Türkiye.

出版信息

Turk Arch Otorhinolaryngol. 2024 Oct 23;62(2):48-57. doi: 10.4274/tao.2024.3.5.

Abstract

OBJECTIVE

Large language models (LLMs) are used in various fields for their ability to produce human-like text. They are particularly useful in medical education, aiding clinical management skills and exam preparation for residents. To evaluate and compare the performance of ChatGPT (GPT-4), Gemini, and Bing with each other and with otorhinolaryngology residents in answering in-service training exam questions and provide insights into the usefulness of these models in medical education and healthcare.

METHODS

Eight otorhinolaryngology in-service training exams were used for comparison. 316 questions were prepared from the Resident Training Textbook of the Turkish Society of Otorhinolaryngology Head and Neck Surgery. These questions were presented to the three artificial intelligence models. The exam results were evaluated to determine the accuracy of both models and residents.

RESULTS

GPT-4 achieved the highest accuracy among the LLMs at 54.75% (GPT-4 vs. Gemini p=0.002, GPT-4 vs. Bing p<0.001), followed by Gemini at 40.50% and Bing at 37.00% (Gemini vs. Bing p=0.327). However, senior residents outperformed all LLMs and other residents with an accuracy rate of 75.5% (p<0.001). The LLMs could only compete with junior residents. GPT- 4 and Gemini performed similarly to juniors, whose accuracy level was 46.90% (p=0.058 and p=0.120, respectively). However, juniors still outperformed Bing (p=0.019).

CONCLUSION

The LLMs currently have limitations in achieving the same medical accuracy as senior and mid-level residents. However, they outperform in specific subspecialties, indicating the potential usefulness in certain medical fields.

摘要

目的

大语言模型(LLMs)因其能够生成类人文本的能力而被应用于各个领域。它们在医学教育中特别有用,有助于培养住院医师的临床管理技能和备考。评估并比较ChatGPT(GPT - 4)、Gemini和必应在回答在职培训考试问题方面相互之间以及与耳鼻喉科住院医师的表现,并深入了解这些模型在医学教育和医疗保健中的实用性。

方法

使用八门耳鼻喉科在职培训考试进行比较。从土耳其耳鼻喉科头颈外科学会的住院医师培训教材中准备了316道问题。将这些问题呈现给这三种人工智能模型。评估考试结果以确定模型和住院医师的准确性。

结果

在大语言模型中,GPT - 4的准确率最高,为54.75%(GPT - 4与Gemini相比,p = 0.002;GPT - 4与必应相比,p < 0.001),其次是Gemini,为40.50%,必应为37.00%(Gemini与必应相比,p = 0.327)。然而,高级住院医师的表现优于所有大语言模型和其他住院医师,准确率为75.5%(p < 0.001)。大语言模型只能与初级住院医师相媲美。GPT - 4和Gemini的表现与初级住院医师相似,初级住院医师的准确率为46.90%(分别为p = 0.058和p = 0.120)。然而,初级住院医师的表现仍优于必应(p = 0.019)。

结论

目前,大语言模型在达到与高级和中级住院医师相同的医学准确性方面存在局限性。然而,它们在特定亚专业领域表现出色,表明在某些医学领域具有潜在的实用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e64/11572338/f9a4cb87890c/TurkArchOtorhinolaryngol-62-48-figure-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验