Department of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, 14263, USA.
BMC Bioinformatics. 2024 Jan 3;25(1):8. doi: 10.1186/s12859-023-05626-0.
The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resources. This is especially pertinent for bioinformaticians engaged in multiple projects. Tools have been created to address challenges in managing and accessing curated genomic datasets, however, the practical utility of such tools becomes especially beneficial for users who seek to work with specific types of data or are technically inclined toward a particular programming language. Currently, there exists a gap in the availability of an R-specific solution for efficient data management and versatile data reuse.
Here we present ReUseData, an R software tool that overcomes some of the limitations of existing solutions and provides a versatile and reproducible approach to effective data management within R. ReUseData facilitates the transformation of ad hoc scripts for data preprocessing into Common Workflow Language (CWL)-based data recipes, allowing for the reproducible generation of curated data files in their generic formats. The data recipes are standardized and self-contained, enabling them to be easily portable and reproducible across various computing platforms. ReUseData also streamlines the reuse of curated data files and their integration into downstream analysis tools and workflows with different frameworks.
ReUseData provides a reliable and reproducible approach for genomic data management within the R environment to enhance the accessibility and reusability of genomic data. The package is available at Bioconductor ( https://bioconductor.org/packages/ReUseData/ ) with additional information on the project website ( https://rcwl.org/dataRecipes/ ).
基因组数据的数量和复杂性不断增加,给有效数据管理和再利用带来了重大挑战。公共基因组数据在项目之间通常会进行类似的预处理,导致数据集冗余或不一致,以及计算资源的低效利用。这对于从事多个项目的生物信息学家来说尤为重要。已经创建了工具来解决管理和访问经过策展的基因组数据集的挑战,但是,对于那些希望使用特定类型的数据或对特定编程语言有技术倾向的用户来说,这些工具的实际效用尤其有益。目前,缺乏针对高效数据管理和通用数据重用的特定于 R 的解决方案。
在这里,我们提出了 ReUseData,这是一个 R 软件工具,它克服了现有解决方案的一些限制,并提供了一种通用且可重复的方法来在 R 中进行有效的数据管理。ReUseData 促进了将数据预处理的临时脚本转换为基于通用工作流程语言(CWL)的数据配方,从而能够以通用格式可重复地生成经过策展的数据文件。数据配方是标准化和自包含的,使它们能够轻松在各种计算平台上移植和重复使用。ReUseData 还简化了经过策展的数据文件的重用,并将其集成到具有不同框架的下游分析工具和工作流中。
ReUseData 为 R 环境中的基因组数据管理提供了一种可靠且可重复的方法,以增强基因组数据的可访问性和可重用性。该软件包可在 Bioconductor(https://bioconductor.org/packages/ReUseData/)上获得,有关项目网站(https://rcwl.org/dataRecipes/)的更多信息。