Zhang Da-Wei, Boey Melissa, Tan Yan Yu, Jia Alexis Hoh Sheng
Department of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University Malaysia, Bandar Sunway, 475000, Malaysia.
NPJ Sci Learn. 2024 Dec 30;9(1):79. doi: 10.1038/s41539-024-00291-1.
This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criterion-based grading with a detailed understanding of the criteria, underscoring the importance of domain-specific understanding over model complexity. These findings highlight the potential of LLMs to deliver scalable educational feedback.
本研究评估了大语言模型(LLMs)进行基于标准评分的能力,并考察了使用详细标准的提示工程对评分的影响。通过使用成熟的人类基准和定量分析,我们发现,即使是免费的大语言模型也能在对标准有详细理解的基础上实现基于标准的评分,这突出了特定领域理解相对于模型复杂性的重要性。这些发现凸显了大语言模型提供可扩展教育反馈的潜力。