Department of Zoology, Aligarh Muslim University, Aligarh, India.
School of Computing and Informatics, The University of Louisiana, Lafayette, LA, United States.
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
Large language models (LLMs) have revolutionized natural language processing with their ability to generate human-like text through extensive training on large data sets. These models, including Generative Pre-trained Transformers (GPT)-3.5 (OpenAI), GPT-4 (OpenAI), and Bard (Google LLC), find applications beyond natural language processing, attracting interest from academia and industry. Students are actively leveraging LLMs to enhance learning experiences and prepare for high-stakes exams, such as the National Eligibility cum Entrance Test (NEET) in India.
This comparative analysis aims to evaluate the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions.
In this paper, we evaluated the performance of the 3 mainstream LLMs, namely GPT-3.5, GPT-4, and Google Bard, in answering questions related to the NEET-2023 exam. The questions of the NEET were provided to these artificial intelligence models, and the responses were recorded and compared against the correct answers from the official answer key. Consensus was used to evaluate the performance of all 3 models.
It was evident that GPT-4 passed the entrance test with flying colors (300/700, 42.9%), showcasing exceptional performance. On the other hand, GPT-3.5 managed to meet the qualifying criteria, but with a substantially lower score (145/700, 20.7%). However, Bard (115/700, 16.4%) failed to meet the qualifying criteria and did not pass the test. GPT-4 demonstrated consistent superiority over Bard and GPT-3.5 in all 3 subjects. Specifically, GPT-4 achieved accuracy rates of 73% (29/40) in physics, 44% (16/36) in chemistry, and 51% (50/99) in biology. Conversely, GPT-3.5 attained an accuracy rate of 45% (18/40) in physics, 33% (13/26) in chemistry, and 34% (34/99) in biology. The accuracy consensus metric showed that the matching responses between GPT-4 and Bard, as well as GPT-4 and GPT-3.5, had higher incidences of being correct, at 0.56 and 0.57, respectively, compared to the matching responses between Bard and GPT-3.5, which stood at 0.42. When all 3 models were considered together, their matching responses reached the highest accuracy consensus of 0.59.
The study's findings provide valuable insights into the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. GPT-4 emerged as the most accurate model, highlighting its potential for educational applications. Cross-checking responses across models may result in confusion as the compared models (as duos or a trio) tend to agree on only a little over half of the correct responses. Using GPT-4 as one of the compared models will result in higher accuracy consensus. The results underscore the suitability of LLMs for high-stakes exams and their positive impact on education. Additionally, the study establishes a benchmark for evaluating and enhancing LLMs' performance in educational tasks, promoting responsible and informed use of these models in diverse learning environments.
大型语言模型(LLMs)通过在大型数据集上进行广泛的训练来生成类人文本,从而彻底改变了自然语言处理。这些模型,包括生成式预训练转换器(GPT)-3.5(OpenAI)、GPT-4(OpenAI)和 Bard(Google LLC),除了在自然语言处理方面的应用外,还吸引了学术界和工业界的兴趣。学生们正在积极利用大型语言模型来增强学习体验,并为印度的全国资格考试(NEET)等重要考试做准备。
本比较分析旨在评估 GPT-3.5、GPT-4 和 Bard 在回答 NEET-2023 问题方面的性能。
在本文中,我们评估了 3 种主流大型语言模型,即 GPT-3.5、GPT-4 和 Google Bard,在回答与 NEET-2023 考试相关的问题方面的表现。将 NEET 的问题提供给这些人工智能模型,并记录它们的回答,并与官方答案进行比较。使用一致性来评估所有 3 种模型的性能。
GPT-4 出色地通过了入学考试(300/700,42.9%),表现出色。另一方面,GPT-3.5 虽然达到了合格标准,但得分明显较低(145/700,20.7%)。然而,Bard(115/700,16.4%)未能达到合格标准,没有通过考试。GPT-4 在所有 3 个科目中都表现出一致的优越性,优于 Bard 和 GPT-3.5。具体来说,GPT-4 在物理方面的准确率为 73%(29/40),在化学方面为 44%(16/36),在生物学方面为 51%(50/99)。相比之下,GPT-3.5 在物理方面的准确率为 45%(18/40),在化学方面为 33%(13/26),在生物学方面为 34%(34/99)。准确性共识指标显示,GPT-4 与 Bard 以及 GPT-4 与 GPT-3.5 的匹配响应中正确的比例更高,分别为 0.56 和 0.57,而 Bard 与 GPT-3.5 的匹配响应中正确的比例为 0.42。当同时考虑所有 3 种模型时,它们的匹配响应达到了 0.59 的最高准确性共识。
该研究的结果为 GPT-3.5、GPT-4 和 Bard 在回答 NEET-2023 问题方面的性能提供了有价值的见解。GPT-4 是最准确的模型,突出了其在教育应用中的潜力。跨模型检查响应可能会导致混淆,因为比较模型(作为二人组或三人组)往往只同意超过一半的正确响应。使用 GPT-4 作为比较模型之一将导致更高的准确性共识。结果强调了大型语言模型在高风险考试中的适用性及其对教育的积极影响。此外,该研究为评估和增强大型语言模型在教育任务中的性能建立了基准,促进了在各种学习环境中负责任和明智地使用这些模型。