Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32610, USA.
UF Genetics Institute, University of Florida, Gainesville, FL, 32610, USA.
Sci Rep. 2024 Mar 7;14(1):5670. doi: 10.1038/s41598-024-55568-7.
The GPT-4 large language model (LLM) and ChatGPT chatbot have emerged as accessible and capable tools for generating English-language text in a variety of formats. GPT-4 has previously performed well when applied to questions from multiple standardized examinations. However, further evaluation of trustworthiness and accuracy of GPT-4 responses across various knowledge domains is essential before its use as a reference resource. Here, we assess GPT-4 performance on nine graduate-level examinations in the biomedical sciences (seven blinded), finding that GPT-4 scores exceed the student average in seven of nine cases and exceed all student scores for four exams. GPT-4 performed very well on fill-in-the-blank, short-answer, and essay questions, and correctly answered several questions on figures sourced from published manuscripts. Conversely, GPT-4 performed poorly on questions with figures containing simulated data and those requiring a hand-drawn answer. Two GPT-4 answer-sets were flagged as plagiarism based on answer similarity and some model responses included detailed hallucinations. In addition to assessing GPT-4 performance, we discuss patterns and limitations in GPT-4 capabilities with the goal of informing design of future academic examinations in the chatbot era.
GPT-4 大型语言模型(LLM)和 ChatGPT 聊天机器人已成为生成各种格式英文文本的便捷且功能强大的工具。GPT-4 此前在应用于多项标准化考试的问题时表现出色。然而,在将 GPT-4 用作参考资源之前,必须对其在各个知识领域的可信度和准确性进行进一步评估。在这里,我们评估了 GPT-4 在九项生物医学研究生水平考试中的表现(七项为盲测),发现 GPT-4 在七种情况下的得分均高于学生平均分,在四项考试中高于所有学生的得分。GPT-4 在填空题、简答题和论文题方面表现出色,并且正确回答了几个来自已发表手稿的图表问题。相反,GPT-4 在包含模拟数据的图表问题和需要手绘答案的问题上表现不佳。根据答案相似性,两个 GPT-4 答案集被标记为抄袭,并且一些模型回答包含了详细的幻觉。除了评估 GPT-4 的表现外,我们还讨论了 GPT-4 能力的模式和局限性,以期为聊天机器人时代的未来学术考试设计提供信息。