The Ohio State University, School of Teaching and Learning, Columbus, USA.
CBE Life Sci Educ. 2011 Winter;10(4):379-93. doi: 10.1187/cbe.11-08-0081.
Our study explored the prospects and limitations of using machine-learning software to score introductory biology students' written explanations of evolutionary change. We investigated three research questions: 1) Do scoring models built using student responses at one university function effectively at another university? 2) How many human-scored student responses are needed to build scoring models suitable for cross-institutional application? 3) What factors limit computer-scoring efficacy, and how can these factors be mitigated? To answer these questions, two biology experts scored a corpus of 2556 short-answer explanations (from biology majors and nonmajors) at two universities for the presence or absence of five key concepts of evolution. Human- and computer-generated scores were compared using kappa agreement statistics. We found that machine-learning software was capable in most cases of accurately evaluating the degree of scientific sophistication in undergraduate majors' and nonmajors' written explanations of evolutionary change. In cases in which the software did not perform at the benchmark of "near-perfect" agreement (kappa > 0.80), we located the causes of poor performance and identified a series of strategies for their mitigation. Machine-learning software holds promise as an assessment tool for use in undergraduate biology education, but like most assessment tools, it is also characterized by limitations.
我们的研究探讨了使用机器学习软件对大学生物学入门学生的进化变化书面解释进行评分的前景和局限性。我们调查了三个研究问题:1)在一所大学使用学生回答建立的评分模型在另一所大学是否有效?2)需要多少人类评分的学生回答来建立适合跨机构应用的评分模型?3)哪些因素限制了计算机评分的效果,如何减轻这些因素的影响?为了回答这些问题,两位生物学专家在两所大学对 2556 个简短答案(来自生物学专业和非专业学生)进行了评分,以确定是否存在进化的五个关键概念。使用kappa 一致性统计比较了人工和计算机生成的分数。我们发现,在大多数情况下,机器学习软件能够准确评估大学生和非专业学生对进化变化的书面解释的科学复杂性程度。在软件表现不如“近乎完美”(kappa > 0.80)的情况下,我们找到了性能不佳的原因,并确定了一系列减轻这些原因的策略。机器学习软件有望成为本科生物学教育中的一种评估工具,但与大多数评估工具一样,它也有其局限性。