• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

模范学生:GPT-4 在研究生生物医学科学考试中的表现。

The model student: GPT-4 performance on graduate biomedical science exams.

机构信息

Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32610, USA.

UF Genetics Institute, University of Florida, Gainesville, FL, 32610, USA.

出版信息

Sci Rep. 2024 Mar 7;14(1):5670. doi: 10.1038/s41598-024-55568-7.

DOI:10.1038/s41598-024-55568-7
PMID:38453979
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10920673/
Abstract

The GPT-4 large language model (LLM) and ChatGPT chatbot have emerged as accessible and capable tools for generating English-language text in a variety of formats. GPT-4 has previously performed well when applied to questions from multiple standardized examinations. However, further evaluation of trustworthiness and accuracy of GPT-4 responses across various knowledge domains is essential before its use as a reference resource. Here, we assess GPT-4 performance on nine graduate-level examinations in the biomedical sciences (seven blinded), finding that GPT-4 scores exceed the student average in seven of nine cases and exceed all student scores for four exams. GPT-4 performed very well on fill-in-the-blank, short-answer, and essay questions, and correctly answered several questions on figures sourced from published manuscripts. Conversely, GPT-4 performed poorly on questions with figures containing simulated data and those requiring a hand-drawn answer. Two GPT-4 answer-sets were flagged as plagiarism based on answer similarity and some model responses included detailed hallucinations. In addition to assessing GPT-4 performance, we discuss patterns and limitations in GPT-4 capabilities with the goal of informing design of future academic examinations in the chatbot era.

摘要

GPT-4 大型语言模型(LLM)和 ChatGPT 聊天机器人已成为生成各种格式英文文本的便捷且功能强大的工具。GPT-4 此前在应用于多项标准化考试的问题时表现出色。然而,在将 GPT-4 用作参考资源之前,必须对其在各个知识领域的可信度和准确性进行进一步评估。在这里,我们评估了 GPT-4 在九项生物医学研究生水平考试中的表现(七项为盲测),发现 GPT-4 在七种情况下的得分均高于学生平均分,在四项考试中高于所有学生的得分。GPT-4 在填空题、简答题和论文题方面表现出色,并且正确回答了几个来自已发表手稿的图表问题。相反,GPT-4 在包含模拟数据的图表问题和需要手绘答案的问题上表现不佳。根据答案相似性,两个 GPT-4 答案集被标记为抄袭,并且一些模型回答包含了详细的幻觉。除了评估 GPT-4 的表现外,我们还讨论了 GPT-4 能力的模式和局限性,以期为聊天机器人时代的未来学术考试设计提供信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8cc7/10920673/43940e62879c/41598_2024_55568_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8cc7/10920673/3613ca3ec82d/41598_2024_55568_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8cc7/10920673/43940e62879c/41598_2024_55568_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8cc7/10920673/3613ca3ec82d/41598_2024_55568_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8cc7/10920673/43940e62879c/41598_2024_55568_Fig2_HTML.jpg

相似文献

1
The model student: GPT-4 performance on graduate biomedical science exams.模范学生:GPT-4 在研究生生物医学科学考试中的表现。
Sci Rep. 2024 Mar 7;14(1):5670. doi: 10.1038/s41598-024-55568-7.
2
Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估:比较研究
JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.
3
Large language models (LLMs) in radiology exams for medical students: Performance and consequences.面向医学生的放射学考试中的大语言模型:表现与影响。
Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.
4
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
5
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
6
Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.GPT-4V 在回答日本耳鼻喉科学委员会认证考试问题方面的表现:评估研究。
JMIR Med Educ. 2024 Mar 28;10:e57054. doi: 10.2196/57054.
7
Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study.ChatGPT-4与日本内科住院医师在普通内科培训考试中的表现比较:比较研究
JMIR Med Educ. 2023 Dec 6;9:e52202. doi: 10.2196/52202.
8
Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索:为医疗 AI 铺平道路。
Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.
9
Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.使用心身医学考试问题评估 ChatGPT 对布鲁姆教育目标分类法的掌握程度:混合方法研究。
J Med Internet Res. 2024 Jan 23;26:e52113. doi: 10.2196/52113.
10
Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions.公开可用的大语言模型在内科医师资格考试风格问题上的表现。
PLOS Digit Health. 2024 Sep 17;3(9):e0000604. doi: 10.1371/journal.pdig.0000604. eCollection 2024 Sep.

引用本文的文献

1
Performance evaluation of GPT-4o on South Korean national exams for building mechanical equipment maintenance.GPT-4o在韩国建筑机械设备维护国家考试中的性能评估。
Sci Rep. 2025 Aug 19;15(1):30436. doi: 10.1038/s41598-025-16118-x.
2
The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies.人工智能代理虚拟实验室设计新型新冠病毒纳米抗体。
Nature. 2025 Jul 29. doi: 10.1038/s41586-025-09442-9.
3
Leveraging large language models for spelling correction in Turkish.利用大语言模型进行土耳其语拼写纠正。

本文引用的文献

1
GPT-4 passes the bar exam.GPT-4通过了律师资格考试。
Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.
2
Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.人类与 GPT-4.0 和 GPT-3.5 在眼科协会自我评估计划中的比较表现。
Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.
3
Does ChatGPT succeed in the European Exam in Core Cardiology?ChatGPT在欧洲核心心脏病学考试中取得成功了吗?
PeerJ Comput Sci. 2025 Jun 16;11:e2889. doi: 10.7717/peerj-cs.2889. eCollection 2025.
4
Towards a World Wide Web powered by generative AI.迈向由生成式人工智能驱动的万维网。
Sci Rep. 2025 Feb 28;15(1):7251. doi: 10.1038/s41598-024-77301-0.
5
Education and Training Assessment and Artificial Intelligence. A Pragmatic Guide for Educators.教育与培训评估及人工智能。教育工作者实用指南。
Br J Biomed Sci. 2025 Feb 5;81:14049. doi: 10.3389/bjbs.2024.14049. eCollection 2024.
6
A Pilot Study of Medical Student Opinions on Large Language Models.一项关于医学生对大语言模型看法的试点研究。
Cureus. 2024 Oct 20;16(10):e71946. doi: 10.7759/cureus.71946. eCollection 2024 Oct.
7
Large language models in biomedicine and health: current research landscape and future directions.生物医学与健康领域的大语言模型:当前研究现状与未来方向
J Am Med Inform Assoc. 2024 Sep 1;31(9):1801-1811. doi: 10.1093/jamia/ocae202.
8
The present and future of seizure detection, prediction, and forecasting with machine learning, including the future impact on clinical trials.利用机器学习进行癫痫发作检测、预测和预报的现状与未来,包括其对临床试验的未来影响。
Front Neurol. 2024 Jul 11;15:1425490. doi: 10.3389/fneur.2024.1425490. eCollection 2024.
9
Current and future applications of artificial intelligence in surgery: implications for clinical practice and research.人工智能在外科手术中的当前及未来应用:对临床实践和研究的启示
Front Surg. 2024 May 9;11:1393898. doi: 10.3389/fsurg.2024.1393898. eCollection 2024.
Eur Heart J Digit Health. 2023 Jul 16;4(5):362-363. doi: 10.1093/ehjdh/ztad040. eCollection 2023 Oct.
4
Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology.在放射肿瘤学培训考试和《红杂志》灰色地带病例上对ChatGPT-4进行基准测试:人工智能辅助放射肿瘤学医学教育和决策的潜力与挑战
Front Oncol. 2023 Sep 14;13:1265024. doi: 10.3389/fonc.2023.1265024. eCollection 2023.
5
Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.比较 ChatGPT 和 GPT-4 在 USMLE 软技能评估中的表现。
Sci Rep. 2023 Oct 1;13(1):16492. doi: 10.1038/s41598-023-43436-9.
6
Fabrication and errors in the bibliographic citations generated by ChatGPT.ChatGPT生成的文献引用中的编造与错误。
Sci Rep. 2023 Sep 7;13(1):14045. doi: 10.1038/s41598-023-41032-5.
7
Perception, performance, and detectability of conversational artificial intelligence across 32 university courses.在 32 门大学课程中对会话式人工智能的感知、性能和可检测性。
Sci Rep. 2023 Aug 24;13(1):12187. doi: 10.1038/s41598-023-38964-3.
8
Modern threats in academia: evaluating plagiarism and artificial intelligence detection scores of ChatGPT.学术界的现代威胁:评估ChatGPT的抄袭和人工智能检测得分
Eye (Lond). 2024 Feb;38(2):397-400. doi: 10.1038/s41433-023-02678-7. Epub 2023 Aug 2.
9
ChatGPT Produces Fabricated References and Falsehoods When Used for Scientific Literature Search.ChatGPT用于科学文献检索时会生成虚假参考文献和错误信息。
J Card Fail. 2023 Sep;29(9):1332-1334. doi: 10.1016/j.cardfail.2023.06.015. Epub 2023 Jul 3.
10
AI model GPT-3 (dis)informs us better than humans.人工智能模型 GPT-3 比人类更能提供信息。
Sci Adv. 2023 Jun 28;9(26):eadh1850. doi: 10.1126/sciadv.adh1850.