• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用dtool进行轻量级数据管理。

Lightweight data management with dtool.

作者信息

Olsson Tjelvar S G, Hartley Matthew

机构信息

Computational Systems Biology, John Innes Centre, Norwich, UK, United Kingdom.

出版信息

PeerJ. 2019 Mar 7;7:e6562. doi: 10.7717/peerj.6562. eCollection 2019.

DOI:10.7717/peerj.6562
PMID:30867992
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6409086/
Abstract

The explosion in volumes and types of data has led to substantial challenges in data management. These challenges are often faced by front-line researchers who are already dealing with rapidly changing technologies and have limited time to devote to data management. There are good high-level guidelines for managing and processing scientific data. However, there is a lack of simple, practical tools to implement these guidelines. This is particularly problematic in a highly distributed research environment where needs differ substantially from group to group and centralised solutions are difficult to implement and storage technologies change rapidly. To meet these challenges we have developed dtool, a command line tool for managing data. The tool packages data and metadata into a unified whole, which we call a dataset. The dataset provides consistency checking and the ability to access metadata for both the whole dataset and individual files. The tool can store these datasets on several different storage systems, including a traditional file system, object store (S3 and Azure) and iRODS. It includes an application programming interface that can be used to incorporate it into existing pipelines and workflows. The tool has provided substantial process, cost, and peace-of-mind benefits to our data management practices and we want to share these benefits. The tool is open source and available freely online at http://dtool.readthedocs.io.

摘要

数据量和数据类型的激增给数据管理带来了巨大挑战。一线研究人员常常面临这些挑战,他们已经在应对快速变化的技术,并且用于数据管理的时间有限。对于管理和处理科学数据有一些很好的高级指导方针。然而,缺乏简单实用的工具来实施这些指导方针。在高度分布式的研究环境中,这一问题尤为突出,因为不同团队的需求差异很大,集中式解决方案难以实施,而且存储技术变化迅速。为了应对这些挑战,我们开发了dtool,这是一个用于管理数据的命令行工具。该工具将数据和元数据打包成一个统一的整体,我们称之为数据集。数据集提供一致性检查功能,并能够访问整个数据集和单个文件的元数据。该工具可以将这些数据集存储在多种不同的存储系统上,包括传统文件系统、对象存储(S3和Azure)以及iRODS。它包括一个应用程序编程接口,可用于将其纳入现有的管道和工作流程中。该工具为我们的数据管理实践带来了显著的流程、成本和安心方面的好处,我们希望分享这些好处。该工具是开源的,可在http://dtool.readthedocs.io上免费在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9bb/6409086/03e7ed5dd423/peerj-07-6562-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9bb/6409086/2ef73ec8db26/peerj-07-6562-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9bb/6409086/03e7ed5dd423/peerj-07-6562-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9bb/6409086/2ef73ec8db26/peerj-07-6562-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9bb/6409086/03e7ed5dd423/peerj-07-6562-g002.jpg

相似文献

1
Lightweight data management with dtool.使用dtool进行轻量级数据管理。
PeerJ. 2019 Mar 7;7:e6562. doi: 10.7717/peerj.6562. eCollection 2019.
2
dtool and dserver: A flexible ecosystem for findable data.dtool 和 dserver:一个用于可发现数据的灵活生态系统。
PLoS One. 2024 Jun 25;19(6):e0306100. doi: 10.1371/journal.pone.0306100. eCollection 2024.
3
SODAR: managing multiomics study data and metadata.SODAR:管理多组学研究数据和元数据。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad052. Epub 2023 Jul 27.
4
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.
5
Experimental Pipeline (Expipe): A Lightweight Data Management Platform to Simplify the Steps From Experiment to Data Analysis.实验管道(Expipe):一个轻量级数据管理平台,用于简化从实验到数据分析的步骤。
Front Neuroinform. 2020 Jul 24;14:30. doi: 10.3389/fninf.2020.00030. eCollection 2020.
6
Implementing a genomic data management system using iRODS in the Wellcome Trust Sanger Institute.在威康信托桑格研究所使用 iRODS 实施基因组数据管理系统。
BMC Bioinformatics. 2011 Sep 9;12:361. doi: 10.1186/1471-2105-12-361.
7
iRODS metadata management for a cancer genome analysis workflow.iRODS 元数据管理在癌症基因组分析工作流中的应用。
BMC Bioinformatics. 2019 Jan 15;20(1):29. doi: 10.1186/s12859-018-2576-5.
8
TCGA Expedition: A Data Acquisition and Management System for TCGA Data.TCGA探索计划:一个用于TCGA数据的数据采集与管理系统。
PLoS One. 2016 Oct 27;11(10):e0165395. doi: 10.1371/journal.pone.0165395. eCollection 2016.
9
NeuroPycon: An open-source python toolbox for fast multi-modal and reproducible brain connectivity pipelines.NeuroPycon:一个开源的 Python 工具包,用于快速进行多模态和可重复的脑连接管道。
Neuroimage. 2020 Oct 1;219:117020. doi: 10.1016/j.neuroimage.2020.117020. Epub 2020 Jun 6.
10
MISS-D: A fast and scalable framework of medical image storage service based on distributed file system.MISS-D:一种基于分布式文件系统的快速可扩展的医学图像存储服务框架。
Comput Methods Programs Biomed. 2020 Apr;186:105189. doi: 10.1016/j.cmpb.2019.105189. Epub 2019 Nov 14.

引用本文的文献

1
Understanding machine learning applications in dementia research and clinical practice: a review for biomedical scientists and clinicians.了解机器学习在痴呆症研究和临床实践中的应用:给生物医学科学家和临床医生的综述
Alzheimers Res Ther. 2024 Aug 1;16(1):175. doi: 10.1186/s13195-024-01540-6.
2
dtool and dserver: A flexible ecosystem for findable data.dtool 和 dserver:一个用于可发现数据的灵活生态系统。
PLoS One. 2024 Jun 25;19(6):e0306100. doi: 10.1371/journal.pone.0306100. eCollection 2024.
3
dtoolAI: Reproducibility for Deep Learning.

本文引用的文献

1
The European Bioinformatics Institute in 2017: data coordination and integration.2017 年欧洲生物信息学研究所:数据协调与整合。
Nucleic Acids Res. 2018 Jan 4;46(D1):D21-D29. doi: 10.1093/nar/gkx1154.
2
UniProt: the universal protein knowledgebase.通用蛋白质知识库:UniProt
Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169. doi: 10.1093/nar/gkw1099. Epub 2016 Nov 29.
3
Ten Simple Rules for Digital Data Storage.数字数据存储的十条简单规则。
dtoolAI:深度学习的可重复性。
Patterns (N Y). 2020 Jul 23;1(5):100073. doi: 10.1016/j.patter.2020.100073. eCollection 2020 Aug 14.
4
Experimental Pipeline (Expipe): A Lightweight Data Management Platform to Simplify the Steps From Experiment to Data Analysis.实验管道(Expipe):一个轻量级数据管理平台,用于简化从实验到数据分析的步骤。
Front Neuroinform. 2020 Jul 24;14:30. doi: 10.3389/fninf.2020.00030. eCollection 2020.
PLoS Comput Biol. 2016 Oct 20;12(10):e1005097. doi: 10.1371/journal.pcbi.1005097. eCollection 2016 Oct.
4
The FAIR Guiding Principles for scientific data management and stewardship.科学数据管理和保存的 FAIR 指导原则。
Sci Data. 2016 Mar 15;3:160018. doi: 10.1038/sdata.2016.18.
5
Ten Simple Rules for Creating a Good Data Management Plan.制定良好数据管理计划的十条简单规则。
PLoS Comput Biol. 2015 Oct 22;11(10):e1004525. doi: 10.1371/journal.pcbi.1004525. eCollection 2015 Oct.
6
Big Data: Astronomical or Genomical?大数据:天文学的还是基因组学的?
PLoS Biol. 2015 Jul 7;13(7):e1002195. doi: 10.1371/journal.pbio.1002195. eCollection 2015 Jul.
7
Fast gapped-read alignment with Bowtie 2.快速缺口读对准与 Bowtie 2。
Nat Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.
8
OMERO: flexible, model-driven data management for experimental biology.OMERO:用于实验生物学的灵活、模型驱动的数据管理。
Nat Methods. 2012 Feb 28;9(3):245-53. doi: 10.1038/nmeth.1896.
9
openBIS: a flexible framework for managing and analyzing complex data in biology research.openBIS:生物学研究中用于管理和分析复杂数据的灵活框架。
BMC Bioinformatics. 2011 Dec 8;12:468. doi: 10.1186/1471-2105-12-468.
10
Implementing a genomic data management system using iRODS in the Wellcome Trust Sanger Institute.在威康信托桑格研究所使用 iRODS 实施基因组数据管理系统。
BMC Bioinformatics. 2011 Sep 9;12:361. doi: 10.1186/1471-2105-12-361.