Navale Vivek, McAuliffe Matthew
Center for Information Technology, National Institutes of Health, Bethesda, Maryland, 20892, USA.
F1000Res. 2018 Aug 29;7:1353. doi: 10.12688/f1000research.16015.1. eCollection 2018.
Genomics and molecular imaging, along with clinical and translational research have transformed biomedical science into a data-intensive scientific endeavor. For researchers to benefit from Big Data sets, developing long-term biomedical digital data preservation strategy is very important. In this opinion article, we discuss specific actions that researchers and institutions can take to make research data a continued resource even after research projects have reached the end of their lifecycle. The actions involve utilizing an Open Archival Information System model comprised of six functional entities: Ingest, Access, Data Management, Archival Storage, Administration and Preservation Planning. We believe that involvement of data stewards early in the digital data life-cycle management process can significantly contribute towards long term preservation of biomedical data. Developing data collection strategies consistent with institutional policies, and encouraging the use of common data elements in clinical research, patient registries and other human subject research can be advantageous for data sharing and integration purposes. Specifically, data stewards at the onset of research program should engage with established repositories and curators to develop data sustainability plans for research data. Placing equal importance on the requirements for initial activities (e.g., collection, processing, storage) with subsequent activities (data analysis, sharing) can improve data quality, provide traceability and support reproducibility. Preparing and tracking data provenance, using common data elements and biomedical ontologies are important for standardizing the data description, making the interpretation and reuse of data easier. The Big Data biomedical community requires scalable platform that can support the diversity and complexity of data ingest modes (e.g. machine, software or human entry modes). Secure virtual workspaces to integrate and manipulate data, with shared software programs (e.g., bioinformatics tools), can facilitate the FAIR (Findable, Accessible, Interoperable and Reusable) use of data for near- and long-term research needs.
基因组学与分子成像,以及临床和转化研究,已将生物医学科学转变为一项数据密集型的科学事业。为使研究人员能从大数据集中受益,制定长期生物医学数字数据保存策略非常重要。在这篇观点文章中,我们讨论了研究人员和机构可以采取的具体行动,以便即使在研究项目进入生命周期末期后,仍能使研究数据成为持续可用的资源。这些行动包括利用一个由六个功能实体组成的开放存档信息系统模型:摄取、访问、数据管理、存档存储、管理和保存规划。我们认为,数据管理员在数字数据生命周期管理过程的早期参与,可显著有助于生物医学数据的长期保存。制定与机构政策一致的数据收集策略,并鼓励在临床研究、患者登记和其他人体研究中使用通用数据元素,对于数据共享和整合目的可能是有利的。具体而言,研究项目启动时的数据管理员应与既定的存储库和策展人合作,为研究数据制定数据可持续性计划。对初始活动(如收集、处理、存储)和后续活动(数据分析、共享)的要求给予同等重视,可提高数据质量、提供可追溯性并支持可重复性。准备和跟踪数据出处、使用通用数据元素和生物医学本体,对于规范数据描述、使数据的解释和重用更容易很重要。大数据生物医学领域需要可扩展的平台,以支持数据摄取模式(如机器、软件或人工输入模式)的多样性和复杂性。带有共享软件程序(如生物信息学工具)的安全虚拟工作区,用于集成和处理数据,可促进为近期和长期研究需求对数据进行公平(可查找、可访问、可互操作和可重用)使用。