评估印度全国医预考用大型语言模型：GPT-3.5、GPT-4 和 Bard 的比较分析。

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.

机构信息

Department of Zoology, Aligarh Muslim University, Aligarh, India.

School of Computing and Informatics, The University of Louisiana, Lafayette, LA, United States.

出版信息

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

DOI:10.2196/51523

PMID:38381486

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10918540/

Abstract

BACKGROUND

Large language models (LLMs) have revolutionized natural language processing with their ability to generate human-like text through extensive training on large data sets. These models, including Generative Pre-trained Transformers (GPT)-3.5 (OpenAI), GPT-4 (OpenAI), and Bard (Google LLC), find applications beyond natural language processing, attracting interest from academia and industry. Students are actively leveraging LLMs to enhance learning experiences and prepare for high-stakes exams, such as the National Eligibility cum Entrance Test (NEET) in India.

OBJECTIVE

This comparative analysis aims to evaluate the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions.

METHODS

In this paper, we evaluated the performance of the 3 mainstream LLMs, namely GPT-3.5, GPT-4, and Google Bard, in answering questions related to the NEET-2023 exam. The questions of the NEET were provided to these artificial intelligence models, and the responses were recorded and compared against the correct answers from the official answer key. Consensus was used to evaluate the performance of all 3 models.

RESULTS

It was evident that GPT-4 passed the entrance test with flying colors (300/700, 42.9%), showcasing exceptional performance. On the other hand, GPT-3.5 managed to meet the qualifying criteria, but with a substantially lower score (145/700, 20.7%). However, Bard (115/700, 16.4%) failed to meet the qualifying criteria and did not pass the test. GPT-4 demonstrated consistent superiority over Bard and GPT-3.5 in all 3 subjects. Specifically, GPT-4 achieved accuracy rates of 73% (29/40) in physics, 44% (16/36) in chemistry, and 51% (50/99) in biology. Conversely, GPT-3.5 attained an accuracy rate of 45% (18/40) in physics, 33% (13/26) in chemistry, and 34% (34/99) in biology. The accuracy consensus metric showed that the matching responses between GPT-4 and Bard, as well as GPT-4 and GPT-3.5, had higher incidences of being correct, at 0.56 and 0.57, respectively, compared to the matching responses between Bard and GPT-3.5, which stood at 0.42. When all 3 models were considered together, their matching responses reached the highest accuracy consensus of 0.59.

CONCLUSIONS

The study's findings provide valuable insights into the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. GPT-4 emerged as the most accurate model, highlighting its potential for educational applications. Cross-checking responses across models may result in confusion as the compared models (as duos or a trio) tend to agree on only a little over half of the correct responses. Using GPT-4 as one of the compared models will result in higher accuracy consensus. The results underscore the suitability of LLMs for high-stakes exams and their positive impact on education. Additionally, the study establishes a benchmark for evaluating and enhancing LLMs' performance in educational tasks, promoting responsible and informed use of these models in diverse learning environments.

摘要

背景

大型语言模型（LLMs）通过在大型数据集上进行广泛的训练来生成类人文本，从而彻底改变了自然语言处理。这些模型，包括生成式预训练转换器（GPT）-3.5（OpenAI）、GPT-4（OpenAI）和 Bard（Google LLC），除了在自然语言处理方面的应用外，还吸引了学术界和工业界的兴趣。学生们正在积极利用大型语言模型来增强学习体验，并为印度的全国资格考试（NEET）等重要考试做准备。

目的

本比较分析旨在评估 GPT-3.5、GPT-4 和 Bard 在回答 NEET-2023 问题方面的性能。

方法

在本文中，我们评估了 3 种主流大型语言模型，即 GPT-3.5、GPT-4 和 Google Bard，在回答与 NEET-2023 考试相关的问题方面的表现。将 NEET 的问题提供给这些人工智能模型，并记录它们的回答，并与官方答案进行比较。使用一致性来评估所有 3 种模型的性能。

结果

GPT-4 出色地通过了入学考试（300/700，42.9%），表现出色。另一方面，GPT-3.5 虽然达到了合格标准，但得分明显较低（145/700，20.7%）。然而，Bard（115/700，16.4%）未能达到合格标准，没有通过考试。GPT-4 在所有 3 个科目中都表现出一致的优越性，优于 Bard 和 GPT-3.5。具体来说，GPT-4 在物理方面的准确率为 73%（29/40），在化学方面为 44%（16/36），在生物学方面为 51%（50/99）。相比之下，GPT-3.5 在物理方面的准确率为 45%（18/40），在化学方面为 33%（13/26），在生物学方面为 34%（34/99）。准确性共识指标显示，GPT-4 与 Bard 以及 GPT-4 与 GPT-3.5 的匹配响应中正确的比例更高，分别为 0.56 和 0.57，而 Bard 与 GPT-3.5 的匹配响应中正确的比例为 0.42。当同时考虑所有 3 种模型时，它们的匹配响应达到了 0.59 的最高准确性共识。

结论

该研究的结果为 GPT-3.5、GPT-4 和 Bard 在回答 NEET-2023 问题方面的性能提供了有价值的见解。GPT-4 是最准确的模型，突出了其在教育应用中的潜力。跨模型检查响应可能会导致混淆，因为比较模型（作为二人组或三人组）往往只同意超过一半的正确响应。使用 GPT-4 作为比较模型之一将导致更高的准确性共识。结果强调了大型语言模型在高风险考试中的适用性及其对教育的积极影响。此外，该研究为评估和增强大型语言模型在教育任务中的性能建立了基准，促进了在各种学习环境中负责任和明智地使用这些模型。

相似文献

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Evaluating large language models on a highly-specialized topic, radiation oncology physics.

Front Oncol. 2023 Jul 17;13:1219326. doi: 10.3389/fonc.2023.1219326. eCollection 2023.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.

Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study.

Cureus. 2023 Dec 12;15(12):e50369. doi: 10.7759/cureus.50369. eCollection 2023 Dec.

Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal.

Clin Exp Nephrol. 2024 May;28(5):465-469. doi: 10.1007/s10157-023-02451-w. Epub 2024 Feb 14.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.

J Cardiothorac Vasc Anesth. 2024 May;38(5):1251-1259. doi: 10.1053/j.jvca.2024.01.032. Epub 2024 Feb 1.

引用本文的文献

Application of Large Language Models in Stroke Rehabilitation Health Education: 2-Phase Study.

J Med Internet Res. 2025 Jul 22;27:e73226. doi: 10.2196/73226.

Artificial Intelligence Outperforms Physicians in General Medical Knowledge, Except in the Paediatrics Domain: A Cross-Sectional Study.

Bioengineering (Basel). 2025 Jun 14;12(6):653. doi: 10.3390/bioengineering12060653.

Exploring the role of artificial intelligence in Turkish orthopedic progression exams.

Acta Orthop Traumatol Turc. 2025 Mar 17;59(1):18-26. doi: 10.5152/j.aott.2025.24090.

Performance of Large Language Models (ChatGPT and Gemini Advanced) in Gastrointestinal Pathology and Clinical Review of Applications in Gastroenterology.

Cureus. 2025 Apr 2;17(4):e81618. doi: 10.7759/cureus.81618. eCollection 2025 Apr.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE.

JAMA Netw Open. 2025 Apr 1;8(4):e256359. doi: 10.1001/jamanetworkopen.2025.6359.

Assessing the Current Limitations of Large Language Models in Advancing Health Care Education.

JMIR Form Res. 2025 Jan 16;9:e51319. doi: 10.2196/51319.

Comparing the performance of ChatGPT-3.5-Turbo, ChatGPT-4, and Google Bard with Iranian students in pre-internship comprehensive exams.

Sci Rep. 2024 Nov 18;14(1):28456. doi: 10.1038/s41598-024-79335-w.

AI in oncology: comparing the diagnostic and therapeutic potential of claude 3 opus and ChatGPT 4.0 in HNSCC management.

Eur Arch Otorhinolaryngol. 2025 Feb;282(2):1121-1122. doi: 10.1007/s00405-024-09062-5. Epub 2024 Nov 14.

ChatGPT as a prospective undergraduate and medical school student.

PLoS One. 2024 Oct 23;19(10):e0308157. doi: 10.1371/journal.pone.0308157. eCollection 2024.

本文引用的文献

A descriptive study based on the comparison of ChatGPT and evidence-based neurosurgeons.

iScience. 2023 Aug 9;26(9):107590. doi: 10.1016/j.isci.2023.107590. eCollection 2023 Sep 15.

Improved Performance of ChatGPT-4 on the OKAP Examination: A Comparative Study with ChatGPT-3.5.

J Acad Ophthalmol (2017). 2023 Sep 11;15(2):e184-e187. doi: 10.1055/s-0043-1774399. eCollection 2023 Jul.

A Promising Start and Not a Panacea: ChatGPT's Early Impact and Potential in Medical Science and Biomedical Engineering Research.

Ann Biomed Eng. 2024 May;52(5):1131-1135. doi: 10.1007/s10439-023-03335-6. Epub 2023 Aug 4.

ChatGPT as a Complementary Mental Health Resource: A Boon or a Bane.

Ann Biomed Eng. 2024 May;52(5):1111-1114. doi: 10.1007/s10439-023-03326-7. Epub 2023 Jul 21.

ChatGPT and Vaccines: Can AI Chatbots Boost Awareness and Uptake?

Ann Biomed Eng. 2024 Mar;52(3):446-450. doi: 10.1007/s10439-023-03305-y. Epub 2023 Jul 10.

ChatGPT can pass the AHA exams: Open-ended questions outperform multiple-choice format.

Resuscitation. 2023 Jul;188:109783. doi: 10.1016/j.resuscitation.2023.109783.

ChatGPT in medical school: how successful is AI in progress testing?

Med Educ Online. 2023 Dec;28(1):2220920. doi: 10.1080/10872981.2023.2220920.

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.

Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions.

JMIR Med Educ. 2023 Jun 1;9:e48291. doi: 10.2196/48291.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估印度全国医预考用大型语言模型：GPT-3.5、GPT-4 和 Bard 的比较分析。

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献