生物化学教育中的大语言模型：性能的比较评估

Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.

作者信息

Bolgova Olena, Shypilova Inna, Mavrych Volodymyr

机构信息

College of Medicine, Alfaisal University, Al Takhassousi St, Riyadh, 11533, Saudi Arabia.

School of Medicine, St Mathews University, George Town, Cayman Islands.

出版信息

JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.

DOI:10.2196/67244

PMID:40209205

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12005600/

Abstract

BACKGROUND

Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation.

OBJECTIVE

The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots-Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)-against the academic results of medical students in the medical biochemistry course.

METHODS

We used 200 USMLE (United States Medical Licensing Examination)-style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4-1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data's basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05.

RESULTS

On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students' performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04).

CONCLUSIONS

Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.

摘要

背景

人工智能（AI）的最新进展，尤其是大语言模型（LLMs），开启了各个领域的创新新时代，医学处于这场技术革命的前沿。许多研究表明，在当前的发展水平下，大语言模型可以通过不同的资格考试。然而，回答特定学科相关问题的能力需要验证。

目的

本研究的目的是进行全面分析，比较先进的大语言模型聊天机器人——Claude（Anthropic公司）、GPT-4（OpenAI公司）、Gemini（谷歌）和Copilot（微软）——与医学生在医学生物化学课程中的学业成绩。

方法

我们从课程考试数据库中选取了200道美国医师执照考试（USMLE）风格的多项选择题（MCQs）。这些题目涵盖了不同的复杂程度，分布在23个不同的主题中。包含表格和图片的题目未纳入本研究。2024年8月，根据Claude 3.5 Sonnet、GPT-4-1106、Gemini 1.5 Flash和Copilot连续5次回答这套问卷的准确性来评估结果。使用Statistica 13.5.0.17（TIBCO软件公司）分析数据的基本统计信息。考虑到数据的二元性质，使用卡方检验比较不同聊天机器人的结果，统计显著性水平为P<0.05。

结果

平均而言，所选聊天机器人正确回答了81.1%（标准差12.8%）的问题，比学生的表现高出8.3%（P = 0.02）。在本研究中，Claude在生物化学多项选择题中表现最佳，正确回答了92.5%（185/200）的问题，其次是GPT-4（170/200，85%）、Gemini（157/200，78.5%）和Copilot（128/200，64%）。聊天机器人在以下4个主题中表现最佳：类二十烷酸（平均100%，标准差0%）、生物能量学和电子传递链（平均96.4%，标准差7.2%）、磷酸戊糖途径（平均91.7%，标准差16.7%）和酮体（平均93.8%，标准差12.5%）。Pearson卡方检验表明，所有4个聊天机器人的答案之间存在统计学上的显著关联（P<0.001至P<0.04）。

结论

我们的研究表明，不同的人工智能模型在特定医学领域可能具有独特优势，可用于生物化学课程的针对性支持。这一表现凸显了人工智能在医学教育和评估中的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d37/12005600/2b45f86fcbe0/mededu-v11-e67244-g001.jpg

相似文献

Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.生物化学教育中的大语言模型：性能的比较评估

JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.

Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.克劳德、ChatGPT、Copilot和Gemini在神经科学不同主题上与学生的表现对比。

Adv Physiol Educ. 2025 Jun 1;49(2):430-437. doi: 10.1152/advan.00093.2024. Epub 2025 Jan 17.

Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.大语言模型在医学胚胎学中的性能比较分析：ChatGPT、Claude、Gemini和Copilot的跨平台研究

Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.

Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.在大体解剖学课程中使用大语言模型（ChatGPT、Copilot、PaLM、Bard和Gemini）：比较分析

Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.

Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.最新大语言模型在回答牙科多项选择题方面的准确性：一项比较研究。

PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.

Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.用土耳其医学肿瘤学会年度委员会考试问题对大型语言模型聊天机器人的肿瘤学知识进行基准测试。

BMC Cancer. 2025 Feb 4;25(1):197. doi: 10.1186/s12885-025-13596-0.

Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study.ChatGPT-4、微软 Copilot 和谷歌 Gemini 在意大利医疗科学学位入学考试中的比较准确性：一项横断面研究。

BMC Med Educ. 2024 Jun 26;24(1):694. doi: 10.1186/s12909-024-05630-9.

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响：比较案例研究

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试

J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.评估人工智能聊天机器人在口腔颌面外科医师资格考试中的表现与潜力

J Oral Maxillofac Surg. 2025 Mar;83(3):382-389. doi: 10.1016/j.joms.2024.11.007. Epub 2024 Nov 19.

引用本文的文献

Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.评估大型语言模型作为医学简答题评分者：与专家人工评分者的比较分析。

Med Educ Online. 2025 Dec;30(1):2550751. doi: 10.1080/10872981.2025.2550751. Epub 2025 Aug 24.

Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型：回答组织学问题的比较性跨平台评估

Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.

本文引用的文献

Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom's Taxonomy.比较ChatGPT-4与医学生在布鲁姆教育目标分类法不同层次多项选择题上的表现。

Adv Med Educ Pract. 2024 May 10;15:393-400. doi: 10.2147/AMEP.S457408. eCollection 2024.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Correction: How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.更正：ChatGPT在美国医师执照考试（USMLE）中的表现如何？大语言模型对医学教育和知识评估的影响。

JMIR Med Educ. 2024 Feb 27;10:e57594. doi: 10.2196/57594.

Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions.大语言模型在医学教育中的应用：比较 ChatGPT 与人工生成的考试题目。

Acad Med. 2024 May 1;99(5):508-512. doi: 10.1097/ACM.0000000000005626. Epub 2023 Dec 28.

Evaluating ChatGPT as a self-learning tool in medical biochemistry: A performance assessment in undergraduate medical university examination.评估ChatGPT作为医学生物化学自学工具的效果：一项本科医科大学考试中的性能评估。

Biochem Mol Biol Educ. 2024 Mar-Apr;52(2):237-248. doi: 10.1002/bmb.21808. Epub 2023 Dec 19.

Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.ChatGPT、Bard、Claude 和 Bing 在秘鲁国家医师执照考试中的表现：一项横断面研究。

J Educ Eval Health Prof. 2023;20:30. doi: 10.3352/jeehp.2023.20.30. Epub 2023 Nov 20.

Evaluating ChatGPT-3.5 and Claude-2 in Answering and Explaining Conceptual Medical Physiology Multiple-Choice Questions.评估ChatGPT-3.5和Claude-2在回答和解释概念性医学生理学选择题方面的表现。

Cureus. 2023 Sep 29;15(9):e46222. doi: 10.7759/cureus.46222. eCollection 2023 Sep.

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment.评估ChatGPT-4在英国医学执照评估中的表现。

Front Med (Lausanne). 2023 Sep 19;10:1240915. doi: 10.3389/fmed.2023.1240915. eCollection 2023.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

生物化学教育中的大语言模型：性能的比较评估

Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献