Ma Li, Peterson Erich A, Shin Ik Jae, Muesse Jason, Marino Katy, Steliga Matthew A, Johann Donald J
Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States.
Department of Information Science, University of Arkansas at Little Rock, Little Rock, AR, United States.
Front Big Data. 2021 Sep 27;4:725095. doi: 10.3389/fdata.2021.725095. eCollection 2021.
Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools. Researchers are rarely able to reproduce published genomic studies. Presented is a novel approach which facilitates accuracy and reproducibility for large genomic research data sets. All data needed is loaded into a portable local database, which serves as an interface for well-known software frameworks. These include python-based Jupyter Notebooks and the use of RStudio projects and R markdown. All software is encapsulated using Docker containers and managed by Git, simplifying software configuration management. Accuracy and reproducibility in science is of a paramount importance. For the biomedical sciences, advances in high throughput technologies, molecular biology and quantitative methods are providing unprecedented insights into disease mechanisms. With these insights come the associated challenge of scientific data that is complex and massive in size. This makes collaboration, verification, validation, and reproducibility of findings difficult. To address these challenges the NGS post-pipeline accuracy and reproducibility system (NPARS) was developed. NPARS is a robust software infrastructure and methodology that can encapsulate data, code, and reporting for large genomic studies. This paper demonstrates the successful use of NPARS on large and complex genomic data sets across different computational platforms.
准确性和可重复性在科学中至关重要,并且在新兴的数据科学学科中构成了重大挑战,尤其是当数据在科学上复杂且规模庞大时。更复杂的是,在基于基因组的科学领域,高通量测序技术会生成大量数据,需要使用大量软件工具进行存储、处理和分析。研究人员很少能够重现已发表的基因组研究。本文提出了一种新颖的方法,可促进大型基因组研究数据集的准确性和可重复性。所需的所有数据都被加载到一个便携式本地数据库中,该数据库作为知名软件框架的接口。这些包括基于Python的Jupyter Notebook以及RStudio项目和R markdown的使用。所有软件都使用Docker容器进行封装,并由Git进行管理,从而简化了软件配置管理。科学中的准确性和可重复性至关重要。对于生物医学科学而言,高通量技术、分子生物学和定量方法的进步正在为疾病机制提供前所未有的见解。伴随着这些见解而来的是科学数据复杂且规模庞大的相关挑战。这使得研究结果的协作、验证、确认和可重复性变得困难。为应对这些挑战,开发了NGS后管道准确性和可重复性系统(NPARS)。NPARS是一种强大的软件基础设施和方法,可封装大型基因组研究的数据、代码和报告。本文展示了NPARS在不同计算平台上对大型复杂基因组数据集的成功应用。