• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

人类与人工智能:ChatGPT-4在临床化学选择题方面表现优于必应、巴德、ChatGPT-3.5和人类。

Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions.

作者信息

Sallam Malik, Al-Salahat Khaled, Eid Huda, Egger Jan, Puladi Behrus

机构信息

Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan.

Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan.

出版信息

Adv Med Educ Pract. 2024 Sep 20;15:857-871. doi: 10.2147/AMEP.S479801. eCollection 2024.

DOI:10.2147/AMEP.S479801
PMID:39319062
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11421444/
Abstract

INTRODUCTION

Artificial intelligence (AI) chatbots excel in language understanding and generation. These models can transform healthcare education and practice. However, it is important to assess the performance of such AI models in various topics to highlight its strengths and possible limitations. This study aimed to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4), Bing, and Bard compared to human students at a postgraduate master's level in Medical Laboratory Sciences.

METHODS

The study design was based on the METRICS checklist for the design and reporting of AI-based studies in healthcare. The study utilized a dataset of 60 Clinical Chemistry multiple-choice questions (MCQs) initially conceived for assessing 20 MSc students. The revised Bloom's taxonomy was used as the framework for classifying the MCQs into four cognitive categories: Remember, Understand, Analyze, and Apply. A modified version of the CLEAR tool was used for the assessment of the quality of AI-generated content, with Cohen's κ for inter-rater agreement.

RESULTS

Compared to the mean students' score which was 0.68±0.23, GPT-4 scored 0.90 ± 0.30, followed by Bing (0.77 ± 0.43), GPT-3.5 (0.73 ± 0.45), and Bard (0.67 ± 0.48). Statistically significant better performance was noted in lower cognitive domains (Remember and Understand) in GPT-3.5 (=0.041), GPT-4 (=0.003), and Bard (=0.017) compared to the higher cognitive domains (Apply and Analyze). The CLEAR scores indicated that ChatGPT-4 performance was "Excellent" compared to the "Above average" performance of ChatGPT-3.5, Bing, and Bard.

DISCUSSION

The findings indicated that ChatGPT-4 excelled in the Clinical Chemistry exam, while ChatGPT-3.5, Bing, and Bard were above average. Given that the MCQs were directed to postgraduate students with a high degree of specialization, the performance of these AI chatbots was remarkable. Due to the risk of academic dishonesty and possible dependence on these AI models, the appropriateness of MCQs as an assessment tool in higher education should be re-evaluated.

摘要

引言

人工智能(AI)聊天机器人在语言理解和生成方面表现出色。这些模型可以改变医疗保健教育和实践。然而,评估此类AI模型在各个主题中的性能以突出其优势和可能的局限性非常重要。本研究旨在评估ChatGPT(GPT-3.5和GPT-4)、必应和巴德与医学检验科学专业研究生水平的人类学生相比的性能。

方法

本研究设计基于医疗保健领域基于AI的研究设计和报告的METRICS清单。该研究使用了一个由60道临床化学多项选择题(MCQ)组成的数据集,这些题目最初是为评估20名理学硕士学生而设计的。修订后的布鲁姆分类法被用作将MCQ分为四个认知类别的框架:记忆、理解、分析和应用。使用CLEAR工具的修改版本来评估AI生成内容的质量,并使用科恩κ系数来评估评分者间的一致性。

结果

与学生的平均得分0.68±0.23相比,GPT-4的得分为0.90±0.30,其次是必应(0.77±0.43)、GPT-3.5(0.73±0.45)和巴德(0.67±0.48)。与较高认知领域(应用和分析)相比,GPT-3.5(=0.041)、GPT-4(=0.003)和巴德(=0.017)在较低认知领域(记忆和理解)的表现具有统计学意义上的显著优势。CLEAR分数表明,与ChatGPT-3.5、必应和巴德的“高于平均水平”表现相比,ChatGPT-4的表现为“优秀”。

讨论

研究结果表明,ChatGPT-4在临床化学考试中表现出色,而ChatGPT-3.5、必应和巴德的表现高于平均水平。鉴于这些MCQ是针对高度专业化的研究生的,这些AI聊天机器人的表现非常出色。由于存在学术不诚实的风险以及可能对这些AI模型的依赖,应重新评估MCQ作为高等教育评估工具的适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d7e6/11421444/09b287cef5b7/AMEP-15-857-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d7e6/11421444/09b287cef5b7/AMEP-15-857-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d7e6/11421444/09b287cef5b7/AMEP-15-857-g0001.jpg

相似文献

1
Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions.人类与人工智能:ChatGPT-4在临床化学选择题方面表现优于必应、巴德、ChatGPT-3.5和人类。
Adv Med Educ Pract. 2024 Sep 20;15:857-871. doi: 10.2147/AMEP.S479801. eCollection 2024.
2
Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology.分析ChatGPT、Bard和必应在医学生理学中生成基于推理的多项选择题的适用性。
Cureus. 2023 Jun 26;15(6):e40977. doi: 10.7759/cureus.40977. eCollection 2023 Jun.
3
Pilot Testing of a Tool to Standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models.用于规范基于人工智能模型生成的健康信息质量评估工具的试点测试。
Cureus. 2023 Nov 24;15(11):e49373. doi: 10.7759/cureus.49373. eCollection 2023 Nov.
4
Comparing the performance of artificial intelligence learning models to medical students in solving histology and embryology multiple choice questions.比较人工智能学习模型与医学生在解决组织学和胚胎学选择题方面的表现。
Ann Anat. 2024 Jun;254:152261. doi: 10.1016/j.aanat.2024.152261. Epub 2024 Mar 21.
5
Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.人工智能聊天机器人对改编自患者手册的青光眼问题的回答情况。
Cureus. 2024 Mar 23;16(3):e56766. doi: 10.7759/cureus.56766. eCollection 2024 Mar.
6
Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study.人工智能驱动的聊天机器人在回答骨科研究生考试问题中的有效性——一项观察性研究。
Int Orthop. 2024 Aug;48(8):1963-1969. doi: 10.1007/s00264-024-06182-9. Epub 2024 Apr 15.
7
Language discrepancies in the performance of generative artificial intelligence models: an examination of infectious disease queries in English and Arabic.生成式人工智能模型在性能方面的语言差异:对英文和阿拉伯文传染病查询的考察。
BMC Infect Dis. 2024 Aug 8;24(1):799. doi: 10.1186/s12879-024-09725-y.
8
Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions With Image Analysis Insights.推进医学教育:生成式人工智能模型在耳鼻喉科委员会备考问题上的表现及图像分析见解
Cureus. 2024 Jul 9;16(7):e64204. doi: 10.7759/cureus.64204. eCollection 2024 Jul.
9
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.ChatGPT、Bard、Claude 和 Bing 在秘鲁国家医师执照考试中的表现:一项横断面研究。
J Educ Eval Health Prof. 2023;20:30. doi: 10.3352/jeehp.2023.20.30. Epub 2023 Nov 20.
10
The scientific knowledge of three large language models in cardiology: multiple-choice questions examination-based performance.三种大型语言模型在心脏病学方面的科学知识:基于多项选择题考试的表现。
Ann Med Surg (Lond). 2024 May 6;86(6):3261-3266. doi: 10.1097/MS9.0000000000002120. eCollection 2024 Jun.

引用本文的文献

1
Chinese generative AI models (DeepSeek and Qwen) rival ChatGPT-4 in ophthalmology queries with excellent performance in Arabic and English.中国生成式人工智能模型(通义千问和文心一言)在眼科问题查询方面可与ChatGPT-4相媲美,在阿拉伯语和英语方面表现出色。
Narra J. 2025 Apr;5(1):e2371. doi: 10.52225/narra.v5i1.2371. Epub 2025 Apr 8.
2
Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination.人工智能会是医学教育评估的未来吗??客观结构化临床考试中的人工智能与人工评估
BMC Med Educ. 2025 May 1;25(1):641. doi: 10.1186/s12909-025-07241-4.
3
Prompt Engineering Paradigms for Medical Applications: Scoping Review.

本文引用的文献

1
Bibliometric top ten healthcare-related ChatGPT publications in the first ChatGPT anniversary.在ChatGPT一周年之际,文献计量学排名前十的医疗保健相关ChatGPT出版物。
Narra J. 2024 Aug;4(2):e917. doi: 10.52225/narra.v4i2.917. Epub 2024 Aug 5.
2
Assessment Study of ChatGPT-3.5's Performance on the Final Polish Medical Examination: Accuracy in Answering 980 Questions.ChatGPT-3.5在波兰医学期末考试中的表现评估研究:回答980个问题的准确性
Healthcare (Basel). 2024 Aug 16;12(16):1637. doi: 10.3390/healthcare12161637.
3
The Role of Artificial Intelligence in the Primary Prevention of Common Musculoskeletal Diseases.
医学应用的提示工程范式:范围综述。
J Med Internet Res. 2024 Sep 10;26:e60501. doi: 10.2196/60501.
4
A multinational study on the factors influencing university students' attitudes and usage of ChatGPT.一项关于影响大学生对 ChatGPT 的态度和使用的因素的跨国研究。
Sci Rep. 2024 Jan 23;14(1):1983. doi: 10.1038/s41598-024-52549-8.
人工智能在常见肌肉骨骼疾病一级预防中的作用
Cureus. 2024 Jul 25;16(7):e65372. doi: 10.7759/cureus.65372. eCollection 2024 Jul.
4
Language discrepancies in the performance of generative artificial intelligence models: an examination of infectious disease queries in English and Arabic.生成式人工智能模型在性能方面的语言差异:对英文和阿拉伯文传染病查询的考察。
BMC Infect Dis. 2024 Aug 8;24(1):799. doi: 10.1186/s12879-024-09725-y.
5
Comparative Analysis of Artificial Intelligence Platforms: ChatGPT-3.5 and GoogleBard in Identifying Red Flags of Low Back Pain.人工智能平台的比较分析:ChatGPT-3.5和GoogleBard在识别腰痛警示信号方面的应用
Cureus. 2024 Jul 1;16(7):e63580. doi: 10.7759/cureus.63580. eCollection 2024 Jul.
6
Generative AI in healthcare: an implementation science informed translational path on application, integration and governance.生成式人工智能在医疗保健领域的应用、整合和治理:基于实施科学的转化途径。
Implement Sci. 2024 Mar 15;19(1):27. doi: 10.1186/s13012-024-01357-9.
7
ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations.ChatGPT在医学、牙科、药学和公共卫生教育中的应用:一项突出优势与局限的描述性研究。
Narra J. 2023 Apr;3(1):e103. doi: 10.52225/narra.v3i1.103. Epub 2023 Mar 29.
8
A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review.一份用于规范基于生成式人工智能模型的医疗保健教育与实践研究设计和报告的初步清单(METRICS):涉及文献综述的开发研究
Interact J Med Res. 2024 Feb 15;13:e54704. doi: 10.2196/54704.
9
A multinational study on the factors influencing university students' attitudes and usage of ChatGPT.一项关于影响大学生对 ChatGPT 的态度和使用的因素的跨国研究。
Sci Rep. 2024 Jan 23;14(1):1983. doi: 10.1038/s41598-024-52549-8.
10
Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.使用心身医学考试问题评估 ChatGPT 对布鲁姆教育目标分类法的掌握程度:混合方法研究。
J Med Internet Res. 2024 Jan 23;26:e52113. doi: 10.2196/52113.