Department of Psychology, Queen's University, Kingston, ON, Canada.
CloudResearch, Queens, NY, USA.
Behav Res Methods. 2023 Dec;55(8):3953-3964. doi: 10.3758/s13428-022-01999-x. Epub 2022 Nov 3.
Maintaining data quality on Amazon Mechanical Turk (MTurk) has always been a concern for researchers. These concerns have grown recently due to the bot crisis of 2018 and observations that past safeguards of data quality (e.g., approval ratings of 95%) no longer work. To address data quality concerns, CloudResearch, a third-party website that interfaces with MTurk, has assessed 165,000 MTurkers and categorized them into those that provide high- (100,000, Approved) and low- (~65,000, Blocked) quality data. Here, we examined the predictive validity of CloudResearch's vetting. In a pre-registered study, participants (N = 900) from the Approved and Blocked groups, along with a Standard MTurk sample (95% HIT acceptance ratio, 100+ completed HITs), completed an array of data-quality measures. Across several indices, Approved participants (i) identified the content of images more accurately, (ii) answered more reading comprehension questions correctly, (iii) responded to reversed coded items more consistently, (iv) passed a greater number of attention checks, (v) self-reported less cheating and actually left the survey window less often on easily Googleable questions, (vi) replicated classic psychology experimental effects more reliably, and (vii) answered AI-stumping questions more accurately than Blocked participants, who performed at chance on multiple outcomes. Data quality of the Standard sample was generally in between the Approved and Blocked groups. We discuss how MTurk's Approval Rating system is no longer an effective data-quality control, and we discuss the advantages afforded by using the Approved group for scientific studies on MTurk.
在 Amazon Mechanical Turk(MTurk)上维护数据质量一直是研究人员关注的问题。由于 2018 年的机器人危机以及过去的数据质量保障措施(例如 95%的批准率)不再有效的观察结果,这些担忧最近有所增加。为了解决数据质量问题,第三方网站 CloudResearch 与 MTurk 进行了接口,对大约 165,000 名 MTurker 进行了评估,并将他们分为提供高质量数据的(100,000 名,批准)和低质量数据的(65,000 名,阻止)。在这里,我们检查了 CloudResearch 审查的预测有效性。在一项预先注册的研究中,来自批准和阻止组的参与者(N = 900),以及标准的 MTurk 样本(95%的 HIT 接受率,完成 100+个 HIT),完成了一系列数据质量指标的测试。在几个指标中,批准组的参与者(i)更准确地识别图像的内容,(ii)更正确地回答阅读理解问题,(iii)更一致地回答反向编码项目,(iv)通过了更多的注意力检查,(v)自我报告的作弊行为较少,并且在容易谷歌搜索的问题上实际上较少离开调查窗口,(vi)更可靠地复制了经典心理学实验效应,以及(vii)比阻止组的参与者更准确地回答人工智能难题,阻止组的参与者在多个结果上表现出机会水平。标准样本的数据质量通常介于批准组和阻止组之间。我们讨论了 MTurk 的批准率系统不再是一种有效的数据质量控制,并且讨论了使用批准组进行 MTurk 上科学研究的优势。