Institute for Quantitative Social Science, Harvard University, Cambridge, MA, USA.
CAS Key Laboratory of Forest Ecology and Management, Institute of Applied Ecology, Chinese Academy of Sciences, Shenyang, China.
Sci Data. 2022 Feb 21;9(1):60. doi: 10.1038/s41597-022-01143-6.
This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research reproducibility and reuse. First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020. Second, we execute the code in a clean runtime environment to assess its ease of reuse. Common coding errors were identified, and some of them were solved with automatic code cleaning to aid code execution. We find that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied, showing that many errors can be prevented with good coding practices. We also analyze the replication datasets from journals' collections and discuss the impact of the journal policy strictness on the code re-execution rate. Finally, based on our results, we propose a set of recommendations for code dissemination aimed at researchers, journals, and repositories.
本文研究了哈佛大学数据知识库(Harvard Dataverse repository)中公开复制数据集的研究代码的质量和执行情况。研究代码通常由一组科学家创建,并与学术论文一起发布,以促进研究的透明度和可重复性。在这项研究中,我们定义了十个问题,以解决影响研究可重复性和可重用性的各个方面。首先,我们检索并分析了 2000 多个复制数据集,这些数据集包含了 2010 年至 2020 年间发布的超过 9000 个唯一的 R 文件。其次,我们在一个干净的运行时环境中执行这些代码,以评估其易用性和可重用性。我们发现,74%的 R 文件在初始执行时没有错误,但在应用代码清理时,有 56%的文件无法完成,这表明许多错误可以通过良好的编码实践来预防。我们还分析了期刊集合中的复制数据集,并讨论了期刊政策严格程度对代码重新执行率的影响。最后,基于我们的研究结果,我们为研究人员、期刊和知识库提出了一系列代码传播建议。