模范学生：GPT-4 在研究生生物医学科学考试中的表现。

The model student: GPT-4 performance on graduate biomedical science exams.

机构信息

Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32610, USA.

UF Genetics Institute, University of Florida, Gainesville, FL, 32610, USA.

出版信息

Sci Rep. 2024 Mar 7;14(1):5670. doi: 10.1038/s41598-024-55568-7.

DOI:10.1038/s41598-024-55568-7

PMID:38453979

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10920673/

Abstract

The GPT-4 large language model (LLM) and ChatGPT chatbot have emerged as accessible and capable tools for generating English-language text in a variety of formats. GPT-4 has previously performed well when applied to questions from multiple standardized examinations. However, further evaluation of trustworthiness and accuracy of GPT-4 responses across various knowledge domains is essential before its use as a reference resource. Here, we assess GPT-4 performance on nine graduate-level examinations in the biomedical sciences (seven blinded), finding that GPT-4 scores exceed the student average in seven of nine cases and exceed all student scores for four exams. GPT-4 performed very well on fill-in-the-blank, short-answer, and essay questions, and correctly answered several questions on figures sourced from published manuscripts. Conversely, GPT-4 performed poorly on questions with figures containing simulated data and those requiring a hand-drawn answer. Two GPT-4 answer-sets were flagged as plagiarism based on answer similarity and some model responses included detailed hallucinations. In addition to assessing GPT-4 performance, we discuss patterns and limitations in GPT-4 capabilities with the goal of informing design of future academic examinations in the chatbot era.

摘要

GPT-4 大型语言模型（LLM）和 ChatGPT 聊天机器人已成为生成各种格式英文文本的便捷且功能强大的工具。GPT-4 此前在应用于多项标准化考试的问题时表现出色。然而，在将 GPT-4 用作参考资源之前，必须对其在各个知识领域的可信度和准确性进行进一步评估。在这里，我们评估了 GPT-4 在九项生物医学研究生水平考试中的表现（七项为盲测），发现 GPT-4 在七种情况下的得分均高于学生平均分，在四项考试中高于所有学生的得分。GPT-4 在填空题、简答题和论文题方面表现出色，并且正确回答了几个来自已发表手稿的图表问题。相反，GPT-4 在包含模拟数据的图表问题和需要手绘答案的问题上表现不佳。根据答案相似性，两个 GPT-4 答案集被标记为抄袭，并且一些模型回答包含了详细的幻觉。除了评估 GPT-4 的表现外，我们还讨论了 GPT-4 能力的模式和局限性，以期为聊天机器人时代的未来学术考试设计提供信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8cc7/10920673/3613ca3ec82d/41598_2024_55568_Fig1_HTML.jpg

相似文献

The model student: GPT-4 performance on graduate biomedical science exams.模范学生：GPT-4 在研究生生物医学科学考试中的表现。

Sci Rep. 2024 Mar 7;14(1):5670. doi: 10.1038/s41598-024-55568-7.

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估：比较研究

JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.

Large language models (LLMs) in radiology exams for medical students: Performance and consequences.面向医学生的放射学考试中的大语言模型：表现与影响。

Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响：比较案例研究

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.GPT-4V 在回答日本耳鼻喉科学委员会认证考试问题方面的表现：评估研究。

JMIR Med Educ. 2024 Mar 28;10:e57054. doi: 10.2196/57054.

Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study.ChatGPT-4与日本内科住院医师在普通内科培训考试中的表现比较：比较研究

JMIR Med Educ. 2023 Dec 6;9:e52202. doi: 10.2196/52202.

Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索：为医疗 AI 铺平道路。

Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.使用心身医学考试问题评估 ChatGPT 对布鲁姆教育目标分类法的掌握程度：混合方法研究。

J Med Internet Res. 2024 Jan 23;26:e52113. doi: 10.2196/52113.

Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions.公开可用的大语言模型在内科医师资格考试风格问题上的表现。

PLOS Digit Health. 2024 Sep 17;3(9):e0000604. doi: 10.1371/journal.pdig.0000604. eCollection 2024 Sep.

引用本文的文献

Performance evaluation of GPT-4o on South Korean national exams for building mechanical equipment maintenance.GPT-4o在韩国建筑机械设备维护国家考试中的性能评估。

Sci Rep. 2025 Aug 19;15(1):30436. doi: 10.1038/s41598-025-16118-x.

The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies.人工智能代理虚拟实验室设计新型新冠病毒纳米抗体。

Nature. 2025 Jul 29. doi: 10.1038/s41586-025-09442-9.

Leveraging large language models for spelling correction in Turkish.利用大语言模型进行土耳其语拼写纠正。

PeerJ Comput Sci. 2025 Jun 16;11:e2889. doi: 10.7717/peerj-cs.2889. eCollection 2025.

Towards a World Wide Web powered by generative AI.迈向由生成式人工智能驱动的万维网。

Sci Rep. 2025 Feb 28;15(1):7251. doi: 10.1038/s41598-024-77301-0.

Education and Training Assessment and Artificial Intelligence. A Pragmatic Guide for Educators.教育与培训评估及人工智能。教育工作者实用指南。

Br J Biomed Sci. 2025 Feb 5;81:14049. doi: 10.3389/bjbs.2024.14049. eCollection 2024.

A Pilot Study of Medical Student Opinions on Large Language Models.一项关于医学生对大语言模型看法的试点研究。

Cureus. 2024 Oct 20;16(10):e71946. doi: 10.7759/cureus.71946. eCollection 2024 Oct.

Large language models in biomedicine and health: current research landscape and future directions.生物医学与健康领域的大语言模型：当前研究现状与未来方向

J Am Med Inform Assoc. 2024 Sep 1;31(9):1801-1811. doi: 10.1093/jamia/ocae202.

The present and future of seizure detection, prediction, and forecasting with machine learning, including the future impact on clinical trials.利用机器学习进行癫痫发作检测、预测和预报的现状与未来，包括其对临床试验的未来影响。

Front Neurol. 2024 Jul 11;15:1425490. doi: 10.3389/fneur.2024.1425490. eCollection 2024.

Current and future applications of artificial intelligence in surgery: implications for clinical practice and research.人工智能在外科手术中的当前及未来应用：对临床实践和研究的启示

Front Surg. 2024 May 9;11:1393898. doi: 10.3389/fsurg.2024.1393898. eCollection 2024.

本文引用的文献

GPT-4 passes the bar exam.GPT-4通过了律师资格考试。

Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.人类与 GPT-4.0 和 GPT-3.5 在眼科协会自我评估计划中的比较表现。

Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.

Does ChatGPT succeed in the European Exam in Core Cardiology?ChatGPT在欧洲核心心脏病学考试中取得成功了吗？

Eur Heart J Digit Health. 2023 Jul 16;4(5):362-363. doi: 10.1093/ehjdh/ztad040. eCollection 2023 Oct.

Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology.在放射肿瘤学培训考试和《红杂志》灰色地带病例上对ChatGPT-4进行基准测试：人工智能辅助放射肿瘤学医学教育和决策的潜力与挑战

Front Oncol. 2023 Sep 14;13:1265024. doi: 10.3389/fonc.2023.1265024. eCollection 2023.

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.比较 ChatGPT 和 GPT-4 在 USMLE 软技能评估中的表现。

Sci Rep. 2023 Oct 1;13(1):16492. doi: 10.1038/s41598-023-43436-9.

Fabrication and errors in the bibliographic citations generated by ChatGPT.ChatGPT生成的文献引用中的编造与错误。

Sci Rep. 2023 Sep 7;13(1):14045. doi: 10.1038/s41598-023-41032-5.

Perception, performance, and detectability of conversational artificial intelligence across 32 university courses.在 32 门大学课程中对会话式人工智能的感知、性能和可检测性。

Sci Rep. 2023 Aug 24;13(1):12187. doi: 10.1038/s41598-023-38964-3.

Modern threats in academia: evaluating plagiarism and artificial intelligence detection scores of ChatGPT.学术界的现代威胁：评估ChatGPT的抄袭和人工智能检测得分

Eye (Lond). 2024 Feb;38(2):397-400. doi: 10.1038/s41433-023-02678-7. Epub 2023 Aug 2.

ChatGPT Produces Fabricated References and Falsehoods When Used for Scientific Literature Search.ChatGPT用于科学文献检索时会生成虚假参考文献和错误信息。

J Card Fail. 2023 Sep;29(9):1332-1334. doi: 10.1016/j.cardfail.2023.06.015. Epub 2023 Jul 3.

AI model GPT-3 (dis)informs us better than humans.人工智能模型 GPT-3 比人类更能提供信息。

Sci Adv. 2023 Jun 28;9(26):eadh1850. doi: 10.1126/sciadv.adh1850.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

模范学生：GPT-4 在研究生生物医学科学考试中的表现。

The model student: GPT-4 performance on graduate biomedical science exams.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献