关于Chat GPT和其他大语言模型在牙周病学书面期末考试中的表现的一项试点研究。

A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam.

作者信息

Ramlogan Shaun, Raman Vidya, Ramlogan Shayn

机构信息

School of Dentistry, Faculty of Medical Sciences, The University of the West Indies, St Augustine Campus, EWMSC, Champs Fleurs, Trinidad and Tobago.

School of Medicine, Faculty of Medical Sciences, The University of the West Indies, St Augustine Campus, EWMSC, Champs Fleurs, Trinidad and Tobago.

出版信息

BMC Med Educ. 2025 May 19;25(1):727. doi: 10.1186/s12909-025-07195-7.

DOI:10.1186/s12909-025-07195-7

PMID:40389910

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12090576/

Abstract

Large Language Models (LLMs) such as Chat GPT are being increasingly utilized by students in education with reportedly adequate academic responses. Chat GPT is expected to learn and improve with time. Thus, the aim was to longitudinally compare the performance of the current versions of Chat GPT-4/GPT4o with that of final-year DDS students on a written periodontology exam. Other current non-subscription LLMs were also compared to the students. Chat GPT-4, guided by the exam parameters, generated answers as 'Run 1' and 6 months later as as 'Run 2'. Chat GPT-4o generated answers as 'Run 3' at 15 months later. All LLMs and student scripts were marked independently by two periodontology lecturers (Cohen's Kappa value 0.71). 'Run 1' and 'Run 3' generated statistically significantly (p < 0.001) higher mean scores of 78% and 77% compared to the students (60%). The mean scores of Chat GPT-4 and GPT4o were also similar to that of the best student. 'Run 2' performed at the level of the students but underperformed with generalizations, more inaccuracies and incomplete answers compared to 'Run 1' and 'Run 3'. This variability for 'Run 2' may be due to outdated data sources, hallucinations and inherent LLM limitations such as online traffic, availability of datasets and resources. Other non-subscription LLMs such as Claude, DeepSeek, Gemini and Le Chat also produced statistically significantly (p < 0.001) higher scores compared to the students. Claude was the best performing LLM with more comprehensive answers. LLMs such as Chat GPT may provide summaries and model answers in clinical undergraduate periodontology education. However, the result must be interpreted with caution regarding academic accuracy and credibility especially in a health care profession.

摘要

诸如Chat GPT之类的大语言模型（LLMs）在教育领域正越来越多地被学生使用，据报道其能给出足够的学术性回答。预计Chat GPT会随着时间推移不断学习和改进。因此，本研究旨在纵向比较当前版本的Chat GPT - 4/GPT4o与牙科博士（DDS）最后一年学生在牙周病学书面考试中的表现。还将其他当前的非订阅式大语言模型与学生的表现进行了比较。Chat GPT - 4在考试参数的引导下，生成了“运行1”的答案，并在6个月后生成了“运行2”的答案。Chat GPT - 4o在15个月后生成了“运行3”的答案。所有大语言模型和学生的答卷均由两位牙周病学讲师独立评分（科恩卡帕值为0.71）。与学生（60%）相比，“运行1”和“运行3”的平均得分在统计学上显著更高（p < 0.001），分别为78%和77%。Chat GPT - 4和GPT4o的平均得分也与最优秀的学生相似。“运行2”的表现与学生相当，但与“运行1”和“运行3”相比，在进行概括时表现较差，存在更多不准确和不完整的答案。“运行2”的这种变异性可能是由于数据源过时、幻觉以及大语言模型的固有局限性，如网络流量、数据集和资源的可用性等。其他非订阅式大语言模型，如Claude、DeepSeek、Gemini和Le Chat，与学生相比，在统计学上也显著更高（p < 0.001）。Claude是表现最佳的大语言模型，答案更全面。诸如Chat GPT之类的大语言模型可能会在临床本科牙周病学教育中提供总结和标准答案。然而，对于学术准确性和可信度，尤其是在医疗保健专业中，必须谨慎解读结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a4a8/12090576/d29c3f2399c4/12909_2025_7195_Fig1_HTML.jpg

相似文献

A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam.

BMC Med Educ. 2025 May 19;25(1):727. doi: 10.1186/s12909-025-07195-7.

Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.

J Periodontal Res. 2025 Feb;60(2):121-133. doi: 10.1111/jre.13323. Epub 2024 Jul 18.

Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.

J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.

Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.

Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.

Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.

Neurosurg Rev. 2025 Mar 25;48(1):320. doi: 10.1007/s10143-025-03472-7.

Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology.

Cureus. 2025 Apr 8;17(4):e81871. doi: 10.7759/cureus.81871. eCollection 2025 Apr.

Large language models in periodontology: Assessing their performance in clinically relevant questions.

J Prosthet Dent. 2024 Nov 18. doi: 10.1016/j.prosdent.2024.10.020.

Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.

JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.

J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.

本文引用的文献

A systematic review of large language models and their implications in medical education.

Med Educ. 2024 Nov;58(11):1276-1285. doi: 10.1111/medu.15402. Epub 2024 Apr 19.

The Role of Chatbot GPT Technology in Undergraduate Dental Education.

Cureus. 2024 Feb 14;16(2):e54193. doi: 10.7759/cureus.54193. eCollection 2024 Feb.

Reshaping medical education: Performance of ChatGPT on a PES medical examination.

Cardiol J. 2024;31(3):442-450. doi: 10.5603/cj.97517. Epub 2023 Oct 13.

Generative AI (gAI) in medical education: Chat-GPT and co.

GMS J Med Educ. 2023 Jun 15;40(4):Doc54. doi: 10.3205/zma001636. eCollection 2023.

ChatGPT-A double-edged sword for healthcare education? Implications for assessments of dental students.

Eur J Dent Educ. 2024 Feb;28(1):206-211. doi: 10.1111/eje.12937. Epub 2023 Aug 7.

The Potential Usefulness of ChatGPT in Oral and Maxillofacial Radiology.

Cureus. 2023 Jul 19;15(7):e42133. doi: 10.7759/cureus.42133. eCollection 2023 Jul.

An Esthetic Approach for Rehabilitation of Long-Span Edentulous Arch Using Artificial Intelligence.

Cureus. 2023 May 7;15(5):e38683. doi: 10.7759/cureus.38683. eCollection 2023 May.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Periodontitis: Consensus report of workgroup 2 of the 2017 World Workshop on the Classification of Periodontal and Peri-Implant Diseases and Conditions.

J Periodontol. 2018 Jun;89 Suppl 1:S173-S182. doi: 10.1002/JPER.17-0721.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

关于Chat GPT和其他大语言模型在牙周病学书面期末考试中的表现的一项试点研究。

A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献