• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT 在口腔本科考试自动作文评分中的可靠性。

Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations.

机构信息

Faculty of Dentistry, National University of Singapore, Singapore, Singapore.

Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, 9 Lower Kent Ridge Road, Singapore, Singapore.

出版信息

BMC Med Educ. 2024 Sep 3;24(1):962. doi: 10.1186/s12909-024-05881-6.

DOI:10.1186/s12909-024-05881-6
PMID:39227811
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11373238/
Abstract

BACKGROUND

This study aimed to answer the research question: How reliable is ChatGPT in automated essay scoring (AES) for oral and maxillofacial surgery (OMS) examinations for dental undergraduate students compared to human assessors?

METHODS

Sixty-nine undergraduate dental students participated in a closed-book examination comprising two essays at the National University of Singapore. Using pre-created assessment rubrics, three assessors independently performed manual essay scoring, while one separate assessor performed AES using ChatGPT (GPT-4). Data analyses were performed using the intraclass correlation coefficient and Cronbach's α to evaluate the reliability and inter-rater agreement of the test scores among all assessors. The mean scores of manual versus automated scoring were evaluated for similarity and correlations.

RESULTS

A strong correlation was observed for Question 1 (r = 0.752-0.848, p < 0.001) and a moderate correlation was observed between AES and all manual scorers for Question 2 (r = 0.527-0.571, p < 0.001). Intraclass correlation coefficients of 0.794-0.858 indicated excellent inter-rater agreement, and Cronbach's α of 0.881-0.932 indicated high reliability. For Question 1, the mean AES scores were similar to those for manual scoring (p > 0.05), and there was a strong correlation between AES and manual scores (r = 0.829, p < 0.001). For Question 2, AES scores were significantly lower than manual scores (p < 0.001), and there was a moderate correlation between AES and manual scores (r = 0.599, p < 0.001).

CONCLUSION

This study shows the potential of ChatGPT for essay marking. However, an appropriate rubric design is essential for optimal reliability. With further validation, the ChatGPT has the potential to aid students in self-assessment or large-scale marking automated processes.

摘要

背景

本研究旨在回答研究问题:与人类评估者相比,ChatGPT 在口腔颌面外科(OMS)牙科本科学生的自动化论文评分(AES)中对于口腔颌面外科考试的可靠性如何?

方法

69 名新加坡国立大学的本科牙科学生参加了一项闭卷考试,包括两篇短文。使用预先创建的评估量表,三位评估者独立进行手动论文评分,而另一位评估者则使用 ChatGPT(GPT-4)进行 AES。使用组内相关系数和克朗巴赫α进行数据分析,以评估所有评估者的测试分数的可靠性和评分者间的一致性。手动与自动评分的平均分数评估相似性和相关性。

结果

问题 1 观察到强烈的相关性(r=0.752-0.848,p<0.001),问题 2 观察到 AES 与所有手动评分者之间的中度相关性(r=0.527-0.571,p<0.001)。0.794-0.858 的组内相关系数表明评分者间具有极好的一致性,0.881-0.932 的克朗巴赫α表明可靠性高。对于问题 1,AES 分数与手动评分相似(p>0.05),并且 AES 与手动分数之间存在很强的相关性(r=0.829,p<0.001)。对于问题 2,AES 分数明显低于手动分数(p<0.001),并且 AES 与手动分数之间存在中度相关性(r=0.599,p<0.001)。

结论

本研究表明 ChatGPT 具有用于短文标记的潜力。然而,为了获得最佳可靠性,需要进行适当的量表设计。随着进一步的验证,ChatGPT 有可能帮助学生进行自我评估或大规模标记自动化过程。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a136/11373238/fd555eae6b62/12909_2024_5881_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a136/11373238/56a176b62da2/12909_2024_5881_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a136/11373238/e604757fd7f2/12909_2024_5881_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a136/11373238/fd555eae6b62/12909_2024_5881_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a136/11373238/56a176b62da2/12909_2024_5881_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a136/11373238/e604757fd7f2/12909_2024_5881_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a136/11373238/fd555eae6b62/12909_2024_5881_Fig3_HTML.jpg

相似文献

1
Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations.ChatGPT 在口腔本科考试自动作文评分中的可靠性。
BMC Med Educ. 2024 Sep 3;24(1):962. doi: 10.1186/s12909-024-05881-6.
2
Triple jump examinations for dental student assessment.牙科学学生评估的三级跳远检查。
J Dent Educ. 2013 Oct;77(10):1315-20.
3
Automated essay scoring (AES) of constructed responses in nursing examinations: An evaluation.自动作文评分(AES)在护理考试中的应用:评估。
Nurse Educ Pract. 2021 Jul;54:103085. doi: 10.1016/j.nepr.2021.103085. Epub 2021 May 24.
4
A Multi-institutional Study of the Feasibility and Reliability of the Implementation of Constructed Response Exam Questions.多机构研究构建反应考试问题实施的可行性和可靠性。
Teach Learn Med. 2023 Oct-Dec;35(5):609-622. doi: 10.1080/10401334.2022.2111571. Epub 2022 Aug 20.
5
Development and Validation of a Tool to Evaluate the Evolution of Clinical Reasoning in Trauma Using Virtual Patients.开发并验证一种使用虚拟患者评估创伤临床推理演变的工具。
J Surg Educ. 2018 May-Jun;75(3):779-786. doi: 10.1016/j.jsurg.2017.08.024. Epub 2017 Sep 18.
6
The Revival of Essay-Type Questions in Medical Education: Harnessing Artificial Intelligence and Machine Learning.医学教育中论文型问题的复兴:利用人工智能和机器学习。
J Coll Physicians Surg Pak. 2024 May;34(5):595-599. doi: 10.29271/jcpsp.2024.05.595.
7
Evaluation of ChatGPT's Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study.评价 ChatGPT 在本科牙科教育中的实际应用:混合方法研究。
JMIR Med Educ. 2024 Jan 31;10:e51344. doi: 10.2196/51344.
8
Validation of undergraduate medical student script concordance test (SCT) scores on the clinical assessment of the acute abdomen.本科医学生急性腹痛临床评估脚本一致性测试(SCT)分数的验证
BMC Surg. 2016 Aug 17;16(1):57. doi: 10.1186/s12893-016-0173-y.
9
ChatGPT-A double-edged sword for healthcare education? Implications for assessments of dental students.ChatGPT——医学教育的双刃剑?对牙科学生评估的影响。
Eur J Dent Educ. 2024 Feb;28(1):206-211. doi: 10.1111/eje.12937. Epub 2023 Aug 7.
10
Norming a VALUE rubric to assess graduate information literacy skills.规范一个用于评估研究生信息素养技能的VALUE评分标准。
J Med Libr Assoc. 2016 Jul;104(3):209-14. doi: 10.3163/1536-5050.104.3.005.

引用本文的文献

1
Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.大型语言模型在欧洲普通外科医师资格考试中表现不佳:与专家及外科住院医师的比较研究
BMC Med Educ. 2025 Aug 23;25(1):1193. doi: 10.1186/s12909-025-07856-7.
2
Chat Generative Pre-Trained Transformer (ChatGPT) in Oral and Maxillofacial Surgery: A Narrative Review on Its Research Applications and Limitations.口腔颌面外科中的聊天生成预训练变换器(ChatGPT):关于其研究应用和局限性的叙述性综述
J Clin Med. 2025 Feb 18;14(4):1363. doi: 10.3390/jcm14041363.
3
Using ChatGPT for medical education: the technical perspective.

本文引用的文献

1
Prompt Engineering for Nurse Educators.护士教育工作者的提示工程
Nurse Educ. 2024;49(6):293-299. doi: 10.1097/NNE.0000000000001705. Epub 2024 Jul 5.
2
Optimizing Individual Wound Closure Practice Using Augmented Reality: A Randomized Controlled Study.使用增强现实技术优化个体伤口缝合操作:一项随机对照研究。
Cureus. 2024 Apr 29;16(4):e59296. doi: 10.7759/cureus.59296. eCollection 2024 Apr.
3
Simulation to become a better neurosurgeon. An international prospective controlled trial: The Passion study.模拟训练以成为更优秀的神经外科医生。一项国际前瞻性对照试验:激情研究。
将ChatGPT用于医学教育:技术视角
BMC Med Educ. 2025 Feb 7;25(1):201. doi: 10.1186/s12909-025-06785-9.
Brain Spine. 2024 May 11;4:102829. doi: 10.1016/j.bas.2024.102829. eCollection 2024.
4
ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations.ChatGPT在医学、牙科、药学和公共卫生教育中的应用:一项突出优势与局限的描述性研究。
Narra J. 2023 Apr;3(1):e103. doi: 10.52225/narra.v3i1.103. Epub 2023 Mar 29.
5
Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics.人工智能聊天机器人作为牙髓学公共信息源的有效性和可靠性。
Int Endod J. 2024 Mar;57(3):305-314. doi: 10.1111/iej.14014. Epub 2023 Dec 20.
6
Using AI to Improve Radiologist Performance in Detection of Abnormalities on Chest Radiographs.利用人工智能提高放射科医生在胸部X光片上检测异常的表现。
Radiology. 2023 Dec;309(3):e230860. doi: 10.1148/radiol.230860.
7
"Tell me what is 'better'!" How medical students experience feedback, through the lens of self-regulatory learning.“告诉我什么是‘更好’!”——从自我调节学习的角度看医学生如何体验反馈。
BMC Med Educ. 2023 Nov 22;23(1):895. doi: 10.1186/s12909-023-04842-9.
8
Assessment of the capacity of ChatGPT as a self-learning tool in medical pharmacology: a study using MCQs.评估 ChatGPT 作为医学药理学自学工具的能力:一项使用多项选择题的研究。
BMC Med Educ. 2023 Nov 13;23(1):864. doi: 10.1186/s12909-023-04832-x.
9
A Comprehensive Survey of ChatGPT: Advancements, Applications, Prospects, and Challenges.ChatGPT综合调查:进展、应用、前景与挑战
Meta Radiol. 2023 Sep;1(2). doi: 10.1016/j.metrad.2023.100022. Epub 2023 Oct 7.
10
Assessment of landmark detection in cephalometric radiographs with different conditions of brightness and contrast using the an artificial intelligence software.使用人工智能软件评估不同亮度和对比度条件下头颅侧位片的标志点检测。
Dentomaxillofac Radiol. 2023 Nov;52(8):20230065. doi: 10.1259/dmfr.20230065. Epub 2023 Oct 23.