Rocca Roberta, Yarkoni Tal
Department of Psychology, University of Texas at Austin, Austin, Texas, USA.
Interacting Minds Centre, Aarhus University, Aarhus, Denmark.
Adv Methods Pract Psychol Sci. 2021 Jul-Sep;4(3). doi: 10.1177/25152459211026864. Epub 2021 Sep 23.
Consensus on standards for evaluating models and theories is an integral part of every science. Nonetheless, in psychology, relatively little focus has been placed on defining reliable communal metrics to assess model performance. Evaluation practices are often idiosyncratic and are affected by a number of shortcomings (e.g., failure to assess models' ability to generalize to unseen data) that make it difficult to discriminate between good and bad models. Drawing inspiration from fields such as machine learning and statistical genetics, we argue in favor of introducing common benchmarks as a means of overcoming the lack of reliable model evaluation criteria currently observed in psychology. We discuss a number of principles benchmarks should satisfy to achieve maximal utility, identify concrete steps the community could take to promote the development of such benchmarks, and address a number of potential pitfalls and concerns that may arise in the course of implementation. We argue that reaching consensus on common evaluation benchmarks will foster cumulative progress in psychology and encourage researchers to place heavier emphasis on the practical utility of scientific models.
对模型和理论评估标准达成共识是每一门科学不可或缺的一部分。尽管如此,在心理学领域,相对较少关注定义可靠的通用指标来评估模型性能。评估方法往往因人而异,并且受到许多缺陷(例如,未能评估模型对未见数据的泛化能力)的影响,这使得区分好模型和坏模型变得困难。从机器学习和统计遗传学等领域获得灵感,我们主张引入通用基准作为克服当前心理学中缺乏可靠模型评估标准的一种手段。我们讨论了基准应满足以实现最大效用的一些原则,确定了该领域为促进此类基准的发展可以采取的具体步骤,并解决了实施过程中可能出现的一些潜在陷阱和问题。我们认为,就通用评估基准达成共识将促进心理学的累积性进步,并鼓励研究人员更加强调科学模型的实际效用。