Suppr超能文献

关于Chat GPT和其他大语言模型在牙周病学书面期末考试中的表现的一项试点研究。

A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam.

作者信息

Ramlogan Shaun, Raman Vidya, Ramlogan Shayn

机构信息

School of Dentistry, Faculty of Medical Sciences, The University of the West Indies, St Augustine Campus, EWMSC, Champs Fleurs, Trinidad and Tobago.

School of Medicine, Faculty of Medical Sciences, The University of the West Indies, St Augustine Campus, EWMSC, Champs Fleurs, Trinidad and Tobago.

出版信息

BMC Med Educ. 2025 May 19;25(1):727. doi: 10.1186/s12909-025-07195-7.

Abstract

Large Language Models (LLMs) such as Chat GPT are being increasingly utilized by students in education with reportedly adequate academic responses. Chat GPT is expected to learn and improve with time. Thus, the aim was to longitudinally compare the performance of the current versions of Chat GPT-4/GPT4o with that of final-year DDS students on a written periodontology exam. Other current non-subscription LLMs were also compared to the students. Chat GPT-4, guided by the exam parameters, generated answers as 'Run 1' and 6 months later as as 'Run 2'. Chat GPT-4o generated answers as 'Run 3' at 15 months later. All LLMs and student scripts were marked independently by two periodontology lecturers (Cohen's Kappa value 0.71). 'Run 1' and 'Run 3' generated statistically significantly (p < 0.001) higher mean scores of 78% and 77% compared to the students (60%). The mean scores of Chat GPT-4 and GPT4o were also similar to that of the best student. 'Run 2' performed at the level of the students but underperformed with generalizations, more inaccuracies and incomplete answers compared to 'Run 1' and 'Run 3'. This variability for 'Run 2' may be due to outdated data sources, hallucinations and inherent LLM limitations such as online traffic, availability of datasets and resources. Other non-subscription LLMs such as Claude, DeepSeek, Gemini and Le Chat also produced statistically significantly (p < 0.001) higher scores compared to the students. Claude was the best performing LLM with more comprehensive answers. LLMs such as Chat GPT may provide summaries and model answers in clinical undergraduate periodontology education. However, the result must be interpreted with caution regarding academic accuracy and credibility especially in a health care profession.

摘要

诸如Chat GPT之类的大语言模型(LLMs)在教育领域正越来越多地被学生使用,据报道其能给出足够的学术性回答。预计Chat GPT会随着时间推移不断学习和改进。因此,本研究旨在纵向比较当前版本的Chat GPT - 4/GPT4o与牙科博士(DDS)最后一年学生在牙周病学书面考试中的表现。还将其他当前的非订阅式大语言模型与学生的表现进行了比较。Chat GPT - 4在考试参数的引导下,生成了“运行1”的答案,并在6个月后生成了“运行2”的答案。Chat GPT - 4o在15个月后生成了“运行3”的答案。所有大语言模型和学生的答卷均由两位牙周病学讲师独立评分(科恩卡帕值为0.71)。与学生(60%)相比,“运行1”和“运行3”的平均得分在统计学上显著更高(p < 0.001),分别为78%和77%。Chat GPT - 4和GPT4o的平均得分也与最优秀的学生相似。“运行2”的表现与学生相当,但与“运行1”和“运行3”相比,在进行概括时表现较差,存在更多不准确和不完整的答案。“运行2”的这种变异性可能是由于数据源过时、幻觉以及大语言模型的固有局限性,如网络流量、数据集和资源的可用性等。其他非订阅式大语言模型,如Claude、DeepSeek、Gemini和Le Chat,与学生相比,在统计学上也显著更高(p < 0.001)。Claude是表现最佳的大语言模型,答案更全面。诸如Chat GPT之类的大语言模型可能会在临床本科牙周病学教育中提供总结和标准答案。然而,对于学术准确性和可信度,尤其是在医疗保健专业中,必须谨慎解读结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a4a8/12090576/d29c3f2399c4/12909_2025_7195_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验