• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

NPARS——一种解决基因组数据科学中准确性和可重复性问题的新方法。

NPARS-A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science.

作者信息

Ma Li, Peterson Erich A, Shin Ik Jae, Muesse Jason, Marino Katy, Steliga Matthew A, Johann Donald J

机构信息

Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States.

Department of Information Science, University of Arkansas at Little Rock, Little Rock, AR, United States.

出版信息

Front Big Data. 2021 Sep 27;4:725095. doi: 10.3389/fdata.2021.725095. eCollection 2021.

DOI:10.3389/fdata.2021.725095
PMID:34647017
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8503682/
Abstract

Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools. Researchers are rarely able to reproduce published genomic studies. Presented is a novel approach which facilitates accuracy and reproducibility for large genomic research data sets. All data needed is loaded into a portable local database, which serves as an interface for well-known software frameworks. These include python-based Jupyter Notebooks and the use of RStudio projects and R markdown. All software is encapsulated using Docker containers and managed by Git, simplifying software configuration management. Accuracy and reproducibility in science is of a paramount importance. For the biomedical sciences, advances in high throughput technologies, molecular biology and quantitative methods are providing unprecedented insights into disease mechanisms. With these insights come the associated challenge of scientific data that is complex and massive in size. This makes collaboration, verification, validation, and reproducibility of findings difficult. To address these challenges the NGS post-pipeline accuracy and reproducibility system (NPARS) was developed. NPARS is a robust software infrastructure and methodology that can encapsulate data, code, and reporting for large genomic studies. This paper demonstrates the successful use of NPARS on large and complex genomic data sets across different computational platforms.

摘要

准确性和可重复性在科学中至关重要,并且在新兴的数据科学学科中构成了重大挑战,尤其是当数据在科学上复杂且规模庞大时。更复杂的是,在基于基因组的科学领域,高通量测序技术会生成大量数据,需要使用大量软件工具进行存储、处理和分析。研究人员很少能够重现已发表的基因组研究。本文提出了一种新颖的方法,可促进大型基因组研究数据集的准确性和可重复性。所需的所有数据都被加载到一个便携式本地数据库中,该数据库作为知名软件框架的接口。这些包括基于Python的Jupyter Notebook以及RStudio项目和R markdown的使用。所有软件都使用Docker容器进行封装,并由Git进行管理,从而简化了软件配置管理。科学中的准确性和可重复性至关重要。对于生物医学科学而言,高通量技术、分子生物学和定量方法的进步正在为疾病机制提供前所未有的见解。伴随着这些见解而来的是科学数据复杂且规模庞大的相关挑战。这使得研究结果的协作、验证、确认和可重复性变得困难。为应对这些挑战,开发了NGS后管道准确性和可重复性系统(NPARS)。NPARS是一种强大的软件基础设施和方法,可封装大型基因组研究的数据、代码和报告。本文展示了NPARS在不同计算平台上对大型复杂基因组数据集的成功应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e69d/8503682/937b2b30c8a1/fdata-04-725095-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e69d/8503682/fadf7346ed1f/fdata-04-725095-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e69d/8503682/e82342ed153c/fdata-04-725095-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e69d/8503682/937b2b30c8a1/fdata-04-725095-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e69d/8503682/fadf7346ed1f/fdata-04-725095-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e69d/8503682/e82342ed153c/fdata-04-725095-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e69d/8503682/937b2b30c8a1/fdata-04-725095-g003.jpg

相似文献

1
NPARS-A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science.NPARS——一种解决基因组数据科学中准确性和可重复性问题的新方法。
Front Big Data. 2021 Sep 27;4:725095. doi: 10.3389/fdata.2021.725095. eCollection 2021.
2
The Essential Toolbox of Data Science: Python, R, Git, and Docker.数据科学的基础工具包:Python、R、Git 和 Docker。
Methods Mol Biol. 2020;2104:265-311. doi: 10.1007/978-1-0716-0239-3_15.
3
4
Use of application containers and workflows for genomic data analysis.应用容器和工作流程在基因组数据分析中的应用。
J Pathol Inform. 2016 Dec 30;7:53. doi: 10.4103/2153-3539.197197. eCollection 2016.
5
PM4NGS, a project management framework for next-generation sequencing data analysis.PM4NGS,一个用于下一代测序数据分析的项目管理框架。
Gigascience. 2021 Jan 7;10(1). doi: 10.1093/gigascience/giaa141.
6
ILIAD: a suite of automated Snakemake workflows for processing genomic data for downstream applications.ILIAD:一套用于处理基因组数据以用于下游应用的自动化 Snakemake 工作流程套件。
BMC Bioinformatics. 2023 Nov 8;24(1):424. doi: 10.1186/s12859-023-05548-x.
7
Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学:基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍
8
Computational reproducibility of Jupyter notebooks from biomedical publications.生物医学出版物中 Jupyter 笔记本的计算可重复性。
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giad113.
9
10
CREDO: a friendly Customizable, REproducible, DOcker file generator for bioinformatics applications.CREDO:一个用于生物信息学应用的友好的可定制、可重复、Docker 文件生成器。
BMC Bioinformatics. 2024 Mar 12;25(1):110. doi: 10.1186/s12859-024-05695-9.

引用本文的文献

1
An advanced molecular medicine case report of a rare human tumor using genomics, pathomics, and radiomics.一份运用基因组学、病理组学和放射组学对罕见人类肿瘤进行研究的先进分子医学病例报告。
Front Genet. 2023 Feb 10;13:987175. doi: 10.3389/fgene.2022.987175. eCollection 2022.

本文引用的文献

1
Challenges and Opportunities in Statistics and Data Science: Ten Research Areas.统计学与数据科学中的挑战与机遇:十个研究领域
Harv Data Sci Rev. 2020 Summer;2(3). doi: 10.1162/99608f92.95388fcb. Epub 2020 Sep 30.
2
Cancer Data Science and Computational Medicine.
JCO Clin Cancer Inform. 2021 May;5:487-489. doi: 10.1200/CCI.21.00006.
3
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update.Galaxy 平台,用于实现可访问、可重现和协作的生物医学分析:2020 年更新。
Nucleic Acids Res. 2020 Jul 2;48(W1):W395-W402. doi: 10.1093/nar/gkaa434.
4
Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods.基于读长比对和从头拼接融合转录本的融合转录本检测准确性评估。
Genome Biol. 2019 Oct 21;20(1):213. doi: 10.1186/s13059-019-1842-9.
5
Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2.使用QIIME 2进行可重复、交互式、可扩展和可延伸的微生物组数据科学研究。
Nat Biotechnol. 2019 Aug;37(8):852-857. doi: 10.1038/s41587-019-0209-9.
6
On the low reproducibility of cancer studies.论癌症研究的低可重复性。
Natl Sci Rev. 2018 Sep;5(5):619-624. doi: 10.1093/nsr/nwy021. Epub 2018 Feb 2.
7
IsoformSwitchAnalyzeR: analysis of changes in genome-wide patterns of alternative splicing and its functional consequences.IsoformSwitchAnalyzeR:全基因组范围内可变剪接模式变化及其功能后果的分析。
Bioinformatics. 2019 Nov 1;35(21):4469-4471. doi: 10.1093/bioinformatics/btz247.
8
PiGx: reproducible genomics analysis pipelines with GNU Guix.PiGx:使用 GNU Guix 实现可重复的基因组学分析流程。
Gigascience. 2018 Dec 1;7(12):giy123. doi: 10.1093/gigascience/giy123.
9
smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers.smCounter2:一种带有独特分子标识符的靶向测序数据的精确低频变异调用器。
Bioinformatics. 2019 Apr 15;35(8):1299-1309. doi: 10.1093/bioinformatics/bty790.
10
Container-based bioinformatics with Pachyderm.基于容器的生物信息学与 Pachyderm。
Bioinformatics. 2019 Mar 1;35(5):839-846. doi: 10.1093/bioinformatics/bty699.