Statistical Cybermetrics Research Group, University of Wolverhampton, Wolverhampton, United Kingdom.
MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, United Kingdom.
PLoS One. 2020 Feb 21;15(2):e0229578. doi: 10.1371/journal.pone.0229578. eCollection 2020.
Primary data collected during a research study is often shared and may be reused for new studies. To assess the extent of data sharing in favourable circumstances and whether data sharing checks can be automated, this article investigates summary statistics from primary human genome-wide association studies (GWAS). This type of data is highly suitable for sharing because it is a standard research output, is straightforward to use in future studies (e.g., for secondary analysis), and may be already stored in a standard format for internal sharing within multi-site research projects. Manual checks of 1799 articles from 2010 and 2017 matching a simple PubMed query for molecular epidemiology GWAS were used to identify 314 primary human GWAS papers. Of these, only 13% reported the location of a complete set of GWAS summary data, increasing from 3% in 2010 to 23% in 2017. Whilst information about whether data was shared was typically located clearly within a data availability statement, the exact nature of the shared data was usually unspecified. Thus, data sharing is the exception even in suitable research fields with relatively strong data sharing norms. Moreover, the lack of clear data descriptions within data sharing statements greatly complicates the task of automatically characterising shared data sets.
在研究过程中收集的原始数据通常会被共享,并可能被重新用于新的研究。为了评估在有利情况下数据共享的程度,以及是否可以自动进行数据共享检查,本文调查了原发性人类全基因组关联研究(GWAS)的汇总统计数据。由于这种数据是标准的研究成果,易于在未来的研究中使用(例如,用于二次分析),并且可能已经以标准格式存储在多站点研究项目内部共享,因此非常适合共享。通过手动检查 2010 年和 2017 年与分子流行病学 GWAS 的简单 PubMed 查询匹配的 1799 篇文章,确定了 314 篇原发性人类 GWAS 论文。其中,只有 13%的论文报告了完整的 GWAS 汇总数据的位置,而 2010 年的比例为 3%,2017 年的比例为 23%。尽管关于数据是否共享的信息通常位于数据可用性声明中,但共享数据的确切性质通常未指定。因此,即使在具有相对较强数据共享规范的合适研究领域,数据共享也只是例外。而且,数据共享声明中缺少明确的数据描述极大地增加了自动描述共享数据集的任务的复杂性。