Caltech Library, California Institute of Technology, Pasadena, CA, United States of America.
PLoS One. 2024 Jun 5;19(6):e0304781. doi: 10.1371/journal.pone.0304781. eCollection 2024.
To determine where data is shared and what data is no longer available, this study analyzed data shared by researchers at a single university. 2166 supplemental data links were harvested from the university's institutional repository and web scraped using R. All links that failed to scrape or could not be tested algorithmically were tested for availability by hand. Trends in data availability by link type, age of publication, and data source were examined for patterns. Results show that researchers shared data in hundreds of places. About two-thirds of links to shared data were in the form of URLs and one-third were DOIs, with several FTP links and links directly to files. A surprising 13.4% of shared URL links pointed to a website homepage rather than a specific record on a website. After testing, 5.4% the 2166 supplemental data links were found to be no longer available. DOIs were the type of shared link that was least likely to disappear with a 1.7% loss, with URL loss at 5.9% averaged over time. Links from older publications were more likely to be unavailable, with a data disappearance rate estimated at 2.6% per year, as well as links to data hosted on journal websites. The results support best practice guidance to share data in a data repository using a permanent identifier.
为了确定数据共享的位置以及哪些数据不再可用,本研究分析了单所大学研究人员共享的数据。从大学的机构知识库中提取了 2166 个补充数据链接,并使用 R 进行了网络抓取。所有无法抓取或无法通过算法测试的链接都通过人工测试可用性。按链接类型、出版年龄和数据源检查数据可用性趋势,以寻找模式。结果表明,研究人员在数百个地方共享数据。大约三分之二的共享数据链接以 URL 的形式存在,三分之一是 DOI,还有一些 FTP 链接和直接指向文件的链接。令人惊讶的是,13.4%的共享 URL 链接指向网站主页,而不是网站上的特定记录。经过测试,2166 个补充数据链接中有 5.4%不再可用。DOI 是共享链接中最不容易消失的类型,损失率为 1.7%,平均每年 URL 损失率为 5.9%。来自较旧出版物的链接更有可能无法使用,数据消失率估计为每年 2.6%,以及指向期刊网站托管数据的链接。结果支持使用永久标识符在数据存储库中共享数据的最佳实践指南。