Suppr超能文献

模范学生:GPT-4 在研究生生物医学科学考试中的表现。

The model student: GPT-4 performance on graduate biomedical science exams.

机构信息

Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32610, USA.

UF Genetics Institute, University of Florida, Gainesville, FL, 32610, USA.

出版信息

Sci Rep. 2024 Mar 7;14(1):5670. doi: 10.1038/s41598-024-55568-7.

Abstract

The GPT-4 large language model (LLM) and ChatGPT chatbot have emerged as accessible and capable tools for generating English-language text in a variety of formats. GPT-4 has previously performed well when applied to questions from multiple standardized examinations. However, further evaluation of trustworthiness and accuracy of GPT-4 responses across various knowledge domains is essential before its use as a reference resource. Here, we assess GPT-4 performance on nine graduate-level examinations in the biomedical sciences (seven blinded), finding that GPT-4 scores exceed the student average in seven of nine cases and exceed all student scores for four exams. GPT-4 performed very well on fill-in-the-blank, short-answer, and essay questions, and correctly answered several questions on figures sourced from published manuscripts. Conversely, GPT-4 performed poorly on questions with figures containing simulated data and those requiring a hand-drawn answer. Two GPT-4 answer-sets were flagged as plagiarism based on answer similarity and some model responses included detailed hallucinations. In addition to assessing GPT-4 performance, we discuss patterns and limitations in GPT-4 capabilities with the goal of informing design of future academic examinations in the chatbot era.

摘要

GPT-4 大型语言模型(LLM)和 ChatGPT 聊天机器人已成为生成各种格式英文文本的便捷且功能强大的工具。GPT-4 此前在应用于多项标准化考试的问题时表现出色。然而,在将 GPT-4 用作参考资源之前,必须对其在各个知识领域的可信度和准确性进行进一步评估。在这里,我们评估了 GPT-4 在九项生物医学研究生水平考试中的表现(七项为盲测),发现 GPT-4 在七种情况下的得分均高于学生平均分,在四项考试中高于所有学生的得分。GPT-4 在填空题、简答题和论文题方面表现出色,并且正确回答了几个来自已发表手稿的图表问题。相反,GPT-4 在包含模拟数据的图表问题和需要手绘答案的问题上表现不佳。根据答案相似性,两个 GPT-4 答案集被标记为抄袭,并且一些模型回答包含了详细的幻觉。除了评估 GPT-4 的表现外,我们还讨论了 GPT-4 能力的模式和局限性,以期为聊天机器人时代的未来学术考试设计提供信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8cc7/10920673/3613ca3ec82d/41598_2024_55568_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验