Faculty of Dentistry, National University of Singapore, Singapore, Singapore.
Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, 9 Lower Kent Ridge Road, Singapore, Singapore.
BMC Med Educ. 2024 Sep 3;24(1):962. doi: 10.1186/s12909-024-05881-6.
This study aimed to answer the research question: How reliable is ChatGPT in automated essay scoring (AES) for oral and maxillofacial surgery (OMS) examinations for dental undergraduate students compared to human assessors?
Sixty-nine undergraduate dental students participated in a closed-book examination comprising two essays at the National University of Singapore. Using pre-created assessment rubrics, three assessors independently performed manual essay scoring, while one separate assessor performed AES using ChatGPT (GPT-4). Data analyses were performed using the intraclass correlation coefficient and Cronbach's α to evaluate the reliability and inter-rater agreement of the test scores among all assessors. The mean scores of manual versus automated scoring were evaluated for similarity and correlations.
A strong correlation was observed for Question 1 (r = 0.752-0.848, p < 0.001) and a moderate correlation was observed between AES and all manual scorers for Question 2 (r = 0.527-0.571, p < 0.001). Intraclass correlation coefficients of 0.794-0.858 indicated excellent inter-rater agreement, and Cronbach's α of 0.881-0.932 indicated high reliability. For Question 1, the mean AES scores were similar to those for manual scoring (p > 0.05), and there was a strong correlation between AES and manual scores (r = 0.829, p < 0.001). For Question 2, AES scores were significantly lower than manual scores (p < 0.001), and there was a moderate correlation between AES and manual scores (r = 0.599, p < 0.001).
This study shows the potential of ChatGPT for essay marking. However, an appropriate rubric design is essential for optimal reliability. With further validation, the ChatGPT has the potential to aid students in self-assessment or large-scale marking automated processes.
本研究旨在回答研究问题:与人类评估者相比,ChatGPT 在口腔颌面外科(OMS)牙科本科学生的自动化论文评分(AES)中对于口腔颌面外科考试的可靠性如何?
69 名新加坡国立大学的本科牙科学生参加了一项闭卷考试,包括两篇短文。使用预先创建的评估量表,三位评估者独立进行手动论文评分,而另一位评估者则使用 ChatGPT(GPT-4)进行 AES。使用组内相关系数和克朗巴赫α进行数据分析,以评估所有评估者的测试分数的可靠性和评分者间的一致性。手动与自动评分的平均分数评估相似性和相关性。
问题 1 观察到强烈的相关性(r=0.752-0.848,p<0.001),问题 2 观察到 AES 与所有手动评分者之间的中度相关性(r=0.527-0.571,p<0.001)。0.794-0.858 的组内相关系数表明评分者间具有极好的一致性,0.881-0.932 的克朗巴赫α表明可靠性高。对于问题 1,AES 分数与手动评分相似(p>0.05),并且 AES 与手动分数之间存在很强的相关性(r=0.829,p<0.001)。对于问题 2,AES 分数明显低于手动分数(p<0.001),并且 AES 与手动分数之间存在中度相关性(r=0.599,p<0.001)。
本研究表明 ChatGPT 具有用于短文标记的潜力。然而,为了获得最佳可靠性,需要进行适当的量表设计。随着进一步的验证,ChatGPT 有可能帮助学生进行自我评估或大规模标记自动化过程。