Suppr超能文献

面向医学生的放射学考试中的大语言模型:表现与影响。

Large language models (LLMs) in radiology exams for medical students: Performance and consequences.

作者信息

Gotta Jennifer, Le Hong Quang Anh, Koch Vitali, Gruenewald Leon D, Geyer Tobias, Martin Simon S, Scholtz Jan-Erik, Booz Christian, Pinto Dos Santos Daniel, Mahmoudi Scherwin, Eichler Katrin, Gruber-Rouh Tatjana, Hammerstingl Renate, Biciusca Teodora, Juergens Lisa Joy, Höhne Elena, Mader Christoph, Vogl Thomas J, Reschke Philipp

机构信息

Department of Diagnostic and Interventional Radiology, Goethe University Frankfurt, Frankfurt am Main, Germany.

Institute of Diagnostic and Interventional Radiology, Pediatric Radiology and Neuroradiology, Rostock University Medical Center, Rostock, Germany.

出版信息

Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.

Abstract

The evolving field of medical education is being shaped by technological advancements, including the integration of Large Language Models (LLMs) like ChatGPT. These models could be invaluable resources for medical students, by simplifying complex concepts and enhancing interactive learning by providing personalized support. LLMs have shown impressive performance in professional examinations, even without specific domain training, making them particularly relevant in the medical field. This study aims to assess the performance of LLMs in radiology examinations for medical students, thereby shedding light on their current capabilities and implications.This study was conducted using 151 multiple-choice questions, which were used for radiology exams for medical students. The questions were categorized by type and topic and were then processed using OpenAI's GPT-3.5 and GPT- 4 via their API, or manually put into Perplexity AI with GPT-3.5 and Bing. LLM performance was evaluated overall, by question type and by topic.GPT-3.5 achieved a 67.6% overall accuracy on all 151 questions, while GPT-4 outperformed it significantly with an 88.1% overall accuracy (p<0.001). GPT-4 demonstrated superior performance in both lower-order and higher-order questions compared to GPT-3.5, Perplexity AI, and medical students, with GPT-4 particularly excelling in higher-order questions. All GPT models would have successfully passed the radiology exam for medical students at our university.In conclusion, our study highlights the potential of LLMs as accessible knowledge resources for medical students. GPT-4 performed well on lower-order as well as higher-order questions, making ChatGPT-4 a potentially very useful tool for reviewing radiology exam questions. Radiologists should be aware of ChatGPT's limitations, including its tendency to confidently provide incorrect responses. · ChatGPT demonstrated remarkable performance, achieving a passing grade on a radiology examination for medical students that did not include image questions.. · GPT-4 exhibits significantly improved performance compared to its predecessors GPT-3.5 and Perplexity AI with 88% of questions answered correctly.. · Radiologists as well as medical students should be aware of ChatGPT's limitations, including its tendency to confidently provide incorrect responses.. · Gotta J, Le Hong QA, Koch V et al. Large language models (LLMs) in radiology exams for medical students: Performance and consequences. Fortschr Röntgenstr 2024; DOI 10.1055/a-2437-2067.

摘要

医学教育的不断发展受到技术进步的影响,包括ChatGPT等大语言模型(LLMs)的整合。这些模型对医学生来说可能是非常宝贵的资源,通过简化复杂概念并提供个性化支持来增强互动学习。即使没有经过特定领域的训练,大语言模型在专业考试中也表现出了令人印象深刻的成绩,这使得它们在医学领域尤为重要。本研究旨在评估大语言模型在医学生放射学考试中的表现,从而揭示它们目前的能力和影响。

本研究使用了151道多项选择题,这些题目用于医学生的放射学考试。这些问题按类型和主题进行了分类,然后通过OpenAI的GPT - 3.5和GPT - 4的应用程序编程接口(API)进行处理,或者手动输入带有GPT - 3.5的Perplexity AI和必应。大语言模型的表现通过整体、问题类型和主题进行评估。

GPT - 3.5在所有151道问题上的总体准确率为67.6%,而GPT - 4的总体准确率显著更高,为88.1%(p<0.001)。与GPT - 3.5、Perplexity AI和医学生相比,GPT - 4在低阶和高阶问题上均表现出卓越的性能,在高阶问题上尤其出色。所有GPT模型都能成功通过我校医学生的放射学考试。

总之,我们的研究突出了大语言模型作为医学生可获取的知识资源的潜力。GPT - 4在低阶和高阶问题上都表现良好,使ChatGPT - 4成为复习放射学考试问题的潜在非常有用的工具。放射科医生应该意识到ChatGPT的局限性,包括它倾向于自信地提供错误答案。· ChatGPT表现出色,在一场不包括图像问题的医学生放射学考试中达到了及格分数。· 与它的前身GPT - 3.5和Perplexity AI相比,GPT - 4的表现有显著提升,88%的问题回答正确。· 放射科医生和医学生都应该意识到ChatGPT的局限性,包括它倾向于自信地提供错误答案。· 戈塔J、乐洪QA、科赫V等。大语言模型(LLMs)在医学生放射学考试中的表现及影响。《Fortschr Röntgenstr》2024年;DOI 10.1055/a - 2437 - 2067 。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验