Erdat Efe Cem, Kavak Engin Eren
Department of Medical Oncology, Ankara University Cebeci Hospital, Mamak, Ankara, Turkey.
Department of Medical Oncology, Ankara Etlik City Training and Research Hospital, Yenimahalle, Ankara, Turkey.
BMC Cancer. 2025 Feb 4;25(1):197. doi: 10.1186/s12885-025-13596-0.
BACKGROUND: Large language models (LLMs) have shown promise in various medical applications, including clinical decision-making and education. In oncology, the increasing complexity of patient care and the vast volume of medical literature require efficient tools to assist practitioners. However, the use of LLMs in oncology education and knowledge assessment remains underexplored. This study aims to evaluate and compare the oncological knowledge of four LLMs using standardized board examination questions. METHODS: We assessed the performance of four LLMs-Claude 3.5 Sonnet (Anthropic), ChatGPT 4o (OpenAI), Llama-3 (Meta), and Gemini 1.5 (Google)-using the Turkish Society of Medical Oncology's annual board examination questions from 2016 to 2024. A total of 790 valid multiple-choice questions covering various oncology topics were included. Each model was tested on its ability to answer these questions in Turkish. Performance was analyzed based on the number of correct answers, with statistical comparisons made using chi-square tests and one-way ANOVA. RESULTS: Claude 3.5 Sonnet outperformed the other models, passing all eight exams with an average score of 77.6%. ChatGPT 4o passed seven out of eight exams, with an average score of 67.8%. Llama-3 and Gemini 1.5 showed lower performance, passing four and three exams respectively, with average scores below 50%. Significant differences were observed among the models' performances (F = 17.39, p < 0.001). Claude 3.5 and ChatGPT 4.0 demonstrated higher accuracy across most oncology topics. A decline in performance in recent years, particularly in the 2024 exam, suggests limitations due to outdated training data. CONCLUSIONS: Significant differences in oncological knowledge were observed among the four LLMs, with Claude 3.5 Sonnet and ChatGPT 4o demonstrating superior performance. These findings suggest that advanced LLMs have the potential to serve as valuable tools in oncology education and decision support. However, regular updates and enhancements are necessary to maintain their relevance and accuracy, especially to incorporate the latest medical advancements.
背景:大语言模型(LLMs)在包括临床决策和教育在内的各种医学应用中已展现出前景。在肿瘤学领域,患者护理的复杂性不断增加以及医学文献数量庞大,需要高效工具来协助从业者。然而,大语言模型在肿瘤学教育和知识评估中的应用仍未得到充分探索。本研究旨在使用标准化的委员会考试问题评估和比较四个大语言模型的肿瘤学知识。 方法:我们使用土耳其医学肿瘤学会2016年至2024年的年度委员会考试问题,评估了四个大语言模型——Claude 3.5 Sonnet(Anthropic公司)、ChatGPT 4o(OpenAI公司)、Llama - 3(Meta公司)和Gemini 1.5(谷歌公司)的表现。总共纳入了790个涵盖各种肿瘤学主题的有效多项选择题。每个模型都接受了用土耳其语回答这些问题的能力测试。根据正确答案的数量分析表现,并使用卡方检验和单因素方差分析进行统计比较。 结果:Claude 3.5 Sonnet的表现优于其他模型,通过了所有八门考试,平均成绩为77.6%。ChatGPT 4o通过了八门考试中的七门,平均成绩为67.8%。Llama - 3和Gemini 1.5表现较差,分别通过了四门和三门考试,平均成绩低于50%。在模型表现之间观察到显著差异(F = 17.39,p < 0.001)。Claude 3.5和ChatGPT 4.0在大多数肿瘤学主题上表现出更高的准确性。近年来表现有所下降,特别是在2024年的考试中,这表明由于训练数据过时存在局限性。 结论:在四个大语言模型之间观察到肿瘤学知识存在显著差异,Claude 3.5 Sonnet和ChatGPT 4o表现出色。这些发现表明,先进的大语言模型有潜力成为肿瘤学教育和决策支持的有价值工具。然而,需要定期更新和改进以保持其相关性和准确性,特别是要纳入最新的医学进展。
JMIR Med Educ. 2025-4-10
Adv Physiol Educ. 2025-6-1
J Oral Maxillofac Surg. 2025-3
Med Oral Patol Oral Cir Bucal. 2025-9-1
Clin Transl Oncol. 2025-3-25
Future Oncol. 2024-4-22
Lancet Digit Health. 2023-4
PLOS Digit Health. 2023-2-9