ChatGPT 在口腔本科考试自动作文评分中的可靠性。

Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations.

机构信息

Faculty of Dentistry, National University of Singapore, Singapore, Singapore.

Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, 9 Lower Kent Ridge Road, Singapore, Singapore.

出版信息

BMC Med Educ. 2024 Sep 3;24(1):962. doi: 10.1186/s12909-024-05881-6.

DOI:10.1186/s12909-024-05881-6

PMID:39227811

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11373238/

Abstract

BACKGROUND

This study aimed to answer the research question: How reliable is ChatGPT in automated essay scoring (AES) for oral and maxillofacial surgery (OMS) examinations for dental undergraduate students compared to human assessors?

METHODS

Sixty-nine undergraduate dental students participated in a closed-book examination comprising two essays at the National University of Singapore. Using pre-created assessment rubrics, three assessors independently performed manual essay scoring, while one separate assessor performed AES using ChatGPT (GPT-4). Data analyses were performed using the intraclass correlation coefficient and Cronbach's α to evaluate the reliability and inter-rater agreement of the test scores among all assessors. The mean scores of manual versus automated scoring were evaluated for similarity and correlations.

RESULTS

A strong correlation was observed for Question 1 (r = 0.752-0.848, p < 0.001) and a moderate correlation was observed between AES and all manual scorers for Question 2 (r = 0.527-0.571, p < 0.001). Intraclass correlation coefficients of 0.794-0.858 indicated excellent inter-rater agreement, and Cronbach's α of 0.881-0.932 indicated high reliability. For Question 1, the mean AES scores were similar to those for manual scoring (p > 0.05), and there was a strong correlation between AES and manual scores (r = 0.829, p < 0.001). For Question 2, AES scores were significantly lower than manual scores (p < 0.001), and there was a moderate correlation between AES and manual scores (r = 0.599, p < 0.001).

CONCLUSION

This study shows the potential of ChatGPT for essay marking. However, an appropriate rubric design is essential for optimal reliability. With further validation, the ChatGPT has the potential to aid students in self-assessment or large-scale marking automated processes.

摘要

背景

本研究旨在回答研究问题：与人类评估者相比，ChatGPT 在口腔颌面外科（OMS）牙科本科学生的自动化论文评分（AES）中对于口腔颌面外科考试的可靠性如何？

方法

69 名新加坡国立大学的本科牙科学生参加了一项闭卷考试，包括两篇短文。使用预先创建的评估量表，三位评估者独立进行手动论文评分，而另一位评估者则使用 ChatGPT（GPT-4）进行 AES。使用组内相关系数和克朗巴赫α进行数据分析，以评估所有评估者的测试分数的可靠性和评分者间的一致性。手动与自动评分的平均分数评估相似性和相关性。

结果

问题 1 观察到强烈的相关性（r=0.752-0.848，p<0.001），问题 2 观察到 AES 与所有手动评分者之间的中度相关性（r=0.527-0.571，p<0.001）。0.794-0.858 的组内相关系数表明评分者间具有极好的一致性，0.881-0.932 的克朗巴赫α表明可靠性高。对于问题 1，AES 分数与手动评分相似（p>0.05），并且 AES 与手动分数之间存在很强的相关性（r=0.829，p<0.001）。对于问题 2，AES 分数明显低于手动分数（p<0.001），并且 AES 与手动分数之间存在中度相关性（r=0.599，p<0.001）。

结论

本研究表明 ChatGPT 具有用于短文标记的潜力。然而，为了获得最佳可靠性，需要进行适当的量表设计。随着进一步的验证，ChatGPT 有可能帮助学生进行自我评估或大规模标记自动化过程。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

ChatGPT 在口腔本科考试自动作文评分中的可靠性。

Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

ChatGPT 在口腔本科考试自动作文评分中的可靠性。

Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献