文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

用土耳其医学肿瘤学会年度委员会考试问题对大型语言模型聊天机器人的肿瘤学知识进行基准测试。

Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.

作者信息

Erdat Efe Cem, Kavak Engin Eren

机构信息

Department of Medical Oncology, Ankara University Cebeci Hospital, Mamak, Ankara, Turkey.

Department of Medical Oncology, Ankara Etlik City Training and Research Hospital, Yenimahalle, Ankara, Turkey.

出版信息

BMC Cancer. 2025 Feb 4;25(1):197. doi: 10.1186/s12885-025-13596-0.


DOI:10.1186/s12885-025-13596-0
PMID:39905358
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11792186/
Abstract

BACKGROUND: Large language models (LLMs) have shown promise in various medical applications, including clinical decision-making and education. In oncology, the increasing complexity of patient care and the vast volume of medical literature require efficient tools to assist practitioners. However, the use of LLMs in oncology education and knowledge assessment remains underexplored. This study aims to evaluate and compare the oncological knowledge of four LLMs using standardized board examination questions. METHODS: We assessed the performance of four LLMs-Claude 3.5 Sonnet (Anthropic), ChatGPT 4o (OpenAI), Llama-3 (Meta), and Gemini 1.5 (Google)-using the Turkish Society of Medical Oncology's annual board examination questions from 2016 to 2024. A total of 790 valid multiple-choice questions covering various oncology topics were included. Each model was tested on its ability to answer these questions in Turkish. Performance was analyzed based on the number of correct answers, with statistical comparisons made using chi-square tests and one-way ANOVA. RESULTS: Claude 3.5 Sonnet outperformed the other models, passing all eight exams with an average score of 77.6%. ChatGPT 4o passed seven out of eight exams, with an average score of 67.8%. Llama-3 and Gemini 1.5 showed lower performance, passing four and three exams respectively, with average scores below 50%. Significant differences were observed among the models' performances (F = 17.39, p < 0.001). Claude 3.5 and ChatGPT 4.0 demonstrated higher accuracy across most oncology topics. A decline in performance in recent years, particularly in the 2024 exam, suggests limitations due to outdated training data. CONCLUSIONS: Significant differences in oncological knowledge were observed among the four LLMs, with Claude 3.5 Sonnet and ChatGPT 4o demonstrating superior performance. These findings suggest that advanced LLMs have the potential to serve as valuable tools in oncology education and decision support. However, regular updates and enhancements are necessary to maintain their relevance and accuracy, especially to incorporate the latest medical advancements.

摘要

背景:大语言模型(LLMs)在包括临床决策和教育在内的各种医学应用中已展现出前景。在肿瘤学领域,患者护理的复杂性不断增加以及医学文献数量庞大,需要高效工具来协助从业者。然而,大语言模型在肿瘤学教育和知识评估中的应用仍未得到充分探索。本研究旨在使用标准化的委员会考试问题评估和比较四个大语言模型的肿瘤学知识。 方法:我们使用土耳其医学肿瘤学会2016年至2024年的年度委员会考试问题,评估了四个大语言模型——Claude 3.5 Sonnet(Anthropic公司)、ChatGPT 4o(OpenAI公司)、Llama - 3(Meta公司)和Gemini 1.5(谷歌公司)的表现。总共纳入了790个涵盖各种肿瘤学主题的有效多项选择题。每个模型都接受了用土耳其语回答这些问题的能力测试。根据正确答案的数量分析表现,并使用卡方检验和单因素方差分析进行统计比较。 结果:Claude 3.5 Sonnet的表现优于其他模型,通过了所有八门考试,平均成绩为77.6%。ChatGPT 4o通过了八门考试中的七门,平均成绩为67.8%。Llama - 3和Gemini 1.5表现较差,分别通过了四门和三门考试,平均成绩低于50%。在模型表现之间观察到显著差异(F = 17.39,p < 0.001)。Claude 3.5和ChatGPT 4.0在大多数肿瘤学主题上表现出更高的准确性。近年来表现有所下降,特别是在2024年的考试中,这表明由于训练数据过时存在局限性。 结论:在四个大语言模型之间观察到肿瘤学知识存在显著差异,Claude 3.5 Sonnet和ChatGPT 4o表现出色。这些发现表明,先进的大语言模型有潜力成为肿瘤学教育和决策支持的有价值工具。然而,需要定期更新和改进以保持其相关性和准确性,特别是要纳入最新的医学进展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a8/11792186/344dd970f3f9/12885_2025_13596_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a8/11792186/344dd970f3f9/12885_2025_13596_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a8/11792186/344dd970f3f9/12885_2025_13596_Fig1_HTML.jpg

相似文献

[1]
Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.

BMC Cancer. 2025-2-4

[2]
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.

JMIR Med Educ. 2025-4-10

[3]
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.

Diagn Interv Radiol. 2025-3-3

[4]
Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.

Adv Physiol Educ. 2025-6-1

[5]
Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam.

Cureus. 2024-9-27

[6]
Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.

Neurosurg Rev. 2025-3-25

[7]
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.

J Oral Maxillofac Surg. 2025-3

[8]
The role of artificial intelligence in medical education: an evaluation of Large Language Models (LLMs) on the Turkish Medical Specialty Training Entrance Exam.

BMC Med Educ. 2025-4-25

[9]
Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.

PLoS One. 2025-1-29

[10]
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.

J Surg Educ. 2025-4

引用本文的文献

[1]
Artificial intelligence in maxillofacial trauma: expert ally or unreliable assistant?

Med Oral Patol Oral Cir Bucal. 2025-9-1

[2]
Assessing the accuracy of the GPT-4 model in multidisciplinary tumor board decision prediction.

Clin Transl Oncol. 2025-3-25

本文引用的文献

[1]
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024-7-25

[2]
Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.

J Med Internet Res. 2024-6-27

[3]
Accuracy and usability of artificial intelligence chatbot generated chemotherapy protocols.

Future Oncol. 2024-4-22

[4]
Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps.

Cureus. 2023-11-14

[5]
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.

J Educ Eval Health Prof. 2023

[6]
Are medical oncologists ready for the artificial intelligence revolution? Evaluation of the opinions, knowledge, and experiences of medical oncologists about artificial intelligence technologies.

Med Oncol. 2023-10-9

[7]
Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany.

JMIR Med Educ. 2023-9-4

[8]
Using ChatGPT to write patient clinic letters.

Lancet Digit Health. 2023-4

[9]
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023-2-9

[10]
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023-2-8

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索