McMurry Julie A, Juty Nick, Blomberg Niklas, Burdett Tony, Conlin Tom, Conte Nathalie, Courtot Mélanie, Deck John, Dumontier Michel, Fellows Donal K, Gonzalez-Beltran Alejandra, Gormanns Philipp, Grethe Jeffrey, Hastings Janna, Hériché Jean-Karim, Hermjakob Henning, Ison Jon C, Jimenez Rafael C, Jupp Simon, Kunze John, Laibe Camille, Le Novère Nicolas, Malone James, Martin Maria Jesus, McEntyre Johanna R, Morris Chris, Muilu Juha, Müller Wolfgang, Rocca-Serra Philippe, Sansone Susanna-Assunta, Sariyar Murat, Snoep Jacky L, Soiland-Reyes Stian, Stanford Natalie J, Swainston Neil, Washington Nicole, Williams Alan R, Wimalaratne Sarala M, Winfree Lilly M, Wolstencroft Katherine, Goble Carole, Mungall Christopher J, Haendel Melissa A, Parkinson Helen
Department of Medical Informatics and Epidemiology and OHSU Library, Oregon Health & Science University, Portland, Oregon, United States of America.
European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom.
PLoS Biol. 2017 Jun 29;15(6):e2001414. doi: 10.1371/journal.pbio.2001414. eCollection 2017 Jun.
In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.
在许多学科中,数据高度分散在数千个在线数据库(存储库、注册库和知识库)中。从这些数据库中挖掘价值依赖于数据科学学科以及实现集成的基础实体;标识符是这种集成基础设施的核心组成部分。借鉴我们自己的经验以及其他团队的工作,我们概述了我们学到的关于促进大规模数据集成的标识符特性和最佳实践的十条经验教训。具体而言,我们提出了标识符从业者(数据库提供者)在标识符的设计、提供和重用方面应采取的行动。我们还概述了在各种情况下引用标识符的人员(包括作者和数据生成者)的重要注意事项。虽然每条经验教训的重要性和相关性会因具体情况而异,但有必要提高对如何避免和管理常见标识符问题的认识,尤其是那些与持久性以及网络可访问性/可解析性相关的问题。我们重点关注生命科学领域基于网络的标识符;然而,这些原则在很大程度上也适用于其他学科。