Koesten Laura, Vougiouklis Pavlos, Simperl Elena, Groth Paul
King's College London, London WC2B 4BG, UK.
Huawei Technologies, Edinburgh EH9 3BF, UK.
Patterns (N Y). 2020 Nov 4;1(8):100136. doi: 10.1016/j.patter.2020.100136. eCollection 2020 Nov 13.
The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse.
网络提供了对数以百万计数据集的访问权限,这些数据集在其原始背景之外使用时可能会产生额外的影响。对于是什么使得一个数据集比其他数据集更具可重用性,以及现有的哪些指导方针和框架(如果有的话)会产生影响,我们几乎没有实证性的见解。在本文中,我们通过文献综述探索潜在的可重用特征,并呈现一个关于GitHub上数据集的案例研究,GitHub是一个用于共享代码和数据的流行开放平台。我们描述了一个来自超过65000个存储库的140多万个数据文件的语料库。使用GitHub的参与度指标作为数据集重用的代理,我们将它们与文献中的可重用特征相关联,并使用深度神经网络设计一个初始模型,以预测数据集的可重用性。这展示了原则与可操作见解之间的实际差距,这些见解能让数据发布者和工具设计者实现可证明有助于重用的功能。