Centre for the Study of Existential Risk, University of Cambridge, 16 Mill Lane, Cambridge, CB2 1SB, UK.
Melbourne School of Psychological Sciences, University of Melbourne, Melbourne, Australia.
Behav Res Methods. 2024 Aug;56(5):4958-4973. doi: 10.3758/s13428-023-02234-x. Epub 2023 Oct 13.
In this paper we investigate the criterion validity of forced-choice comparisons of the quality of written arguments with normative solutions. Across two studies, novices and experts assessing quality of reasoning through a forced-choice design were both able to choose arguments supporting more accurate solutions-62.2% (SE = 1%) of the time for novices and 74.4% (SE = 1%) for experts-and arguments produced by larger teams-up to 82% of the time for novices and 85% for experts-with high inter-rater reliability, namely 70.58% (95% CI = 1.18) agreement for novices and 80.98% (95% CI = 2.26) for experts. We also explored two methods for increasing efficiency. We found that the number of comparative judgments needed could be substantially reduced with little accuracy loss by leveraging transitivity and producing quality-of-reasoning assessments using an AVL tree method. Moreover, a regression model trained to predict scores based on automatically derived linguistic features of participants' judgments achieved a high correlation with the objective accuracy scores of the arguments in our dataset. Despite the inherent subjectivity involved in evaluating differing quality of reasoning, the forced-choice paradigm allows even novice raters to perform beyond chance and can provide a valid, reliable, and efficient method for producing quality-of-reasoning assessments at scale.
在本文中,我们研究了通过强制选择比较来衡量书面论证质量与规范解决方案的标准有效性。在两项研究中,通过强制选择设计评估推理质量的新手和专家都能够选择支持更准确解决方案的论点——新手的正确率为 62.2%(SE=1%),专家的正确率为 74.4%(SE=1%)——以及由更大团队产生的论点,新手的正确率高达 82%,专家的正确率高达 85%,评分者之间具有高度的可靠性,即新手的一致性为 70.58%(95%CI=1.18),专家的一致性为 80.98%(95%CI=2.26)。我们还探索了两种提高效率的方法。我们发现,通过利用传递性和使用 AVL 树方法生成基于推理质量的评估,可以大大减少比较判断的次数,而不会降低准确性。此外,根据参与者判断的自动推导的语言特征训练的回归模型与我们数据集中文本的客观准确性得分高度相关。尽管在评估不同推理质量时存在固有的主观性,但强制选择范式即使是新手评分者也能超越机会水平,并可以提供一种有效、可靠且高效的方法,大规模地进行推理质量评估。