National Evolutionary Synthesis Center , Durham, NC , USA ; Department of Biology, Duke University , Durham, NC , USA.
PeerJ. 2013 Oct 1;1:e175. doi: 10.7717/peerj.175. eCollection 2013.
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
背景:在重新使用已发表数据时,对原始贡献者进行归因不仅是对数据创建者的奖励,也是记录研究结果出处的重要手段。先前的研究发现,与没有可用数据的类似研究相比,具有公开数据集的论文获得的引用数量更多。然而,先前的分析很少有足够的统计能力来控制已知预测引文率的许多变量,这导致对“引文收益”的估计不确定。此外,人们对数据随时间和数据集的重复使用模式知之甚少。方法和结果:在这里,我们在控制许多已知引文预测因素的情况下查看引文率,并研究数据重复使用的可变性。在对 10555 项创建基因表达微阵列数据的研究进行多元回归分析时,我们发现,将数据提供给公共存储库的研究比那些未提供数据的类似研究获得了 9%(95%置信区间:5%至 13%)的引用。纳入了出版日期、期刊影响因子、开放获取状态、作者数量、第一作者和最后作者的出版历史、通讯作者所在国家/地区、机构引用历史以及研究主题作为协变量。引文收益随数据集提交日期而变化:对于 2004 年和 2005 年发表的论文,引文收益最为明显,约为 30%。作者在发布数据集后的两年内使用自己的数据集发表了大多数论文,而由第三方研究人员发表的数据重复使用论文至少在六年内继续积累。为了直接研究数据重复使用的模式,我们通过论文全文中提及 GEO 或 ArrayExpress 访问号,汇编了 9724 个第三方数据重复使用实例。第三方数据使用水平很高:对于 2000 年提交的 100 个数据集,我们估计到 2002 年有 40 篇论文在 PubMed 中重复使用了数据集,到 2004 年有 100 篇,到 2005 年有 150 多篇数据重复使用论文发表。数据重复使用分布在广泛的数据集基础上:一个非常保守的估计发现,在 2003 年至 2007 年间提交的 20%的数据集至少被第三方重复使用过一次。结论:在考虑影响引文率的其他因素后,我们发现公开数据的引文收益是稳健的,尽管比以前报告的要小。我们得出结论,第三方数据重复使用存在直接影响,这种影响持续多年,超出了研究人员发表大部分重复使用自己数据的论文的时间。还考虑了可能对引文收益有贡献的其他因素。我们进一步得出结论,至少对于基因表达微阵列数据,归档的数据集中有相当一部分被重复使用,并且自 2003 年以来,数据集的重复使用强度一直在稳步增加。
PeerJ. 2013-10-1
J Biomed Discov Collab. 2010-3-28
Ultraschall Med. 2016-8
Zhonghua Yu Fang Yi Xue Za Zhi. 2020-8-6
Zhonghua Yi Xue Za Zhi. 2020-12-29
PeerJ Comput Sci. 2023-5-16
PLoS Biol. 2006-5
Ecol Evol. 2025-7-22
Open Res Eur. 2025-6-16
Netw Sci (Camb Univ Press). 2024-12
Open Res Eur. 2025-1-22
Biophys J. 2025-4-15
R Soc Open Sci. 2025-3-5
Nature. 2013-1-10
Nat Rev Genet. 2012-12-27
PLoS One. 2011-6-29
Nature. 2011-5-19
J Biomed Discov Collab. 2010-3-28
Evolution. 2010-3-1
Nat Genet. 2009-2
AMIA Annu Symp Proc. 2008-11-6