Department of Psychology, Paris-Lodron-University of Salzburg, Hellbrunner Straße 34, 5020, Salzburg, Austria.
Centre for Cognitive Neuroscience, Paris-Lodron-University of Salzburg, Salzburg, Austria.
Behav Res Methods. 2024 Mar;56(3):1551-1582. doi: 10.3758/s13428-023-02109-1. Epub 2023 May 23.
Reaction time (RT) data are often pre-processed before analysis by rejecting outliers and errors and aggregating the data. In stimulus-response compatibility paradigms such as the approach-avoidance task (AAT), researchers often decide how to pre-process the data without an empirical basis, leading to the use of methods that may harm data quality. To provide this empirical basis, we investigated how different pre-processing methods affect the reliability and validity of the AAT. Our literature review revealed 108 unique pre-processing pipelines among 163 examined studies. Using empirical datasets, we found that validity and reliability were negatively affected by retaining error trials, by replacing error RTs with the mean RT plus a penalty, and by retaining outliers. In the relevant-feature AAT, bias scores were more reliable and valid if computed with D-scores; medians were less reliable and more unpredictable, while means were also less valid. Simulations revealed bias scores were likely to be less accurate if computed by contrasting a single aggregate of all compatible conditions with that of all incompatible conditions, rather than by contrasting separate averages per condition. We also found that multilevel model random effects were less reliable, valid, and stable, arguing against their use as bias scores. We call upon the field to drop these suboptimal practices to improve the psychometric properties of the AAT. We also call for similar investigations in related RT-based bias measures such as the implicit association task, as their commonly accepted pre-processing practices involve many of the aforementioned discouraged methods. HIGHLIGHTS: • Rejecting RTs deviating more than 2 or 3 SD from the mean gives more reliable and valid results than other outlier rejection methods in empirical data • Removing error trials gives more reliable and valid results than retaining them or replacing them with the block mean and an added penalty • Double-difference scores are more reliable than compatibility scores under most circumstances • More reliable and valid results are obtained both in simulated and real data by using double-difference D-scores, which are obtained by dividing a participant's double mean difference score by the SD of their RTs.
反应时间 (RT) 数据在分析前通常要经过预处理,包括剔除异常值和错误值并对数据进行聚合。在趋近回避任务 (AAT) 等刺激-反应相容性范式中,研究人员在没有经验依据的情况下决定如何预处理数据,导致使用了可能损害数据质量的方法。为了提供这种经验依据,我们研究了不同的预处理方法如何影响 AAT 的可靠性和有效性。我们的文献综述在 163 项研究中发现了 108 种独特的预处理管道。使用经验数据集,我们发现保留错误试验、用平均 RT 加惩罚值替换错误 RT 以及保留异常值会降低 AAT 的有效性和可靠性。在相关特征 AAT 中,如果使用 D 分数计算偏差得分,则偏差得分的可靠性和有效性更高;中位数的可靠性更低且更不可预测,均值的有效性也更低。模拟结果表明,如果将所有相容条件的单个聚合与所有不相容条件的聚合进行对比来计算偏差得分,那么偏差得分的准确性可能较低,而不是对每个条件的平均值进行对比。我们还发现,多层模型随机效应的可靠性、有效性和稳定性较差,不建议将其作为偏差得分使用。我们呼吁该领域摒弃这些次优实践,以提高 AAT 的心理计量特性。我们还呼吁对相关的基于 RT 的偏差测量方法进行类似的研究,因为它们普遍接受的预处理实践涉及许多上文提到的不推荐的方法。 亮点: • 在经验数据中,与从平均值偏离 2 或 3 个标准差以上的 RT 相比,使用其他异常值剔除方法得到的结果更可靠、更有效。 • 与保留错误试验或用该试验的组块平均值加一个额外的惩罚值替换错误试验相比,剔除错误试验得到的结果更可靠、更有效。 • 在大多数情况下,双差分数比相容性分数更可靠。 • 在模拟数据和真实数据中,通过使用双差 D 分数(将参与者的双平均差分数除以其 RT 的标准差)可以获得更可靠、更有效的结果。