Suppr超能文献

在一个耳鼻喉科题库上对ChatGPT-4、Copilot、Bard和Gemini Ultra的比较。

Comparison of ChatGPT-4, Copilot, Bard and Gemini Ultra on an Otolaryngology Question Bank.

作者信息

Ramchandani Rashi, Guo Eddie, Mostowy Michael, Kreutz Jason, Sahlollbey Nick, Carr Michele M, Chung Janet, Caulley Lisa

机构信息

Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada.

Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.

出版信息

Clin Otolaryngol. 2025 Jul;50(4):704-711. doi: 10.1111/coa.14302. Epub 2025 Mar 13.

Abstract

OBJECTIVE

To compare the performance of Google Bard, Microsoft Copilot, GPT-4 with vision (GPT-4) and Gemini Ultra on the OTO Chautauqua, a student-created, faculty-reviewed otolaryngology question bank.

STUDY DESIGN

Comparative performance evaluation of different LLMs.

SETTING

N/A.

PARTICIPANTS

N/A.

METHODS

Large language models (LLMs) are being extensively tested in medical education. However, their accuracy and effectiveness remain understudied, particularly in otolaryngology. This study involved inputting 350 single-best-answer multiple choice questions, including 18 image-based questions, into four LLMS. Questions were sourced from six independent question banks related to (a) rhinology, (b) head and neck oncology, (c) endocrinology, (d) general otolaryngology, (e) paediatrics, (f) otology, (g) facial plastics, reconstruction and (h) trauma. LLMs were instructed to provide an output reasoning for their answers, the length of which was recorded.

RESULTS

Aggregate and subgroup analysis revealed that Gemini (79.8%) outperformed the other LLMs, followed by GPT-4 (71.1%), Copilot (68.0%), and Bard (65.1%) in accuracy. The LLMs had significantly different average response lengths, with Bard (x̄ = 1685.24) being the longest and no difference between GPT-4 (x̄ = 827.34) and Copilot (x̄ = 904.12). Gemini's longer responses (x̄ =1291.68) included explanatory images and links. Gemini and GPT-4 correctly answered image-based questions (n = 18), unlike Copilot and Bard, highlighting their adaptability and multimodal capabilities.

CONCLUSION

Gemini outperformed the other LLMs in terms of accuracy, followed by GPT-4, Copilot and Bard. GPT-4, although it has the second-highest accuracy, provides concise and relevant explanations. Despite the promising performance of LLMs, medical learners should cautiously assess accuracy and decision-making reliability.

摘要

目的

在由学生创建并经教师审核的耳鼻喉科题库OTO肖托夸上,比较谷歌巴德、微软副驾驶、带视觉功能的GPT-4(GPT-4)和Gemini Ultra的性能。

研究设计

不同大语言模型的比较性能评估。

设置

无。

参与者

无。

方法

大语言模型(LLMs)正在医学教育中进行广泛测试。然而,它们的准确性和有效性仍未得到充分研究,尤其是在耳鼻喉科。本研究涉及将350道单项最佳答案选择题(包括18道基于图像的问题)输入四个大语言模型。问题来自与以下方面相关的六个独立题库:(a)鼻科学,(b)头颈肿瘤学,(c)内分泌学,(d)普通耳鼻喉科,(e)儿科学,(f)耳科学,(g)面部整形、重建和(h)创伤。要求大语言模型为其答案提供输出推理,并记录推理的长度。

结果

总体和亚组分析显示,Gemini(79.8%)在准确性方面优于其他大语言模型,其次是GPT-4(71.1%)、副驾驶(68.0%)和巴德(65.1%)。大语言模型的平均回答长度有显著差异,巴德(x̄ = 1685.24)最长,GPT-4(x̄ = 827.34)和副驾驶(x̄ = 904.12)之间没有差异。Gemini较长的回答(x̄ = 1291.68)包括解释性图像和链接。与副驾驶和巴德不同,Gemini和GPT-4正确回答了基于图像的问题(n = 18),突出了它们的适应性和多模态能力。

结论

Gemini在准确性方面优于其他大语言模型,其次是GPT-4、副驾驶和巴德。GPT-4虽然准确性第二高,但提供简洁相关的解释。尽管大语言模型表现出良好前景,但医学学习者应谨慎评估准确性和决策可靠性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c1e0/12128011/79b37e5e49fc/COA-50-704-g002.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验