PyGMQL：用于异构基因组数据集的数据提取和分析的可扩展方法。

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

机构信息

Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.

出版信息

BMC Bioinformatics. 2019 Nov 8;20(1):560. doi: 10.1186/s12859-019-3159-9.

DOI:10.1186/s12859-019-3159-9

PMID:31703553

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6842186/

Abstract

BACKGROUND

With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation.

RESULTS

We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine.

CONCLUSIONS

PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

摘要

背景

随着可用测序数据集的增长，对异质处理数据的分析可以回答越来越相关的生物学和临床问题。科学家们在对异质处理数据集执行高效且可重复的数据提取和分析管道方面面临挑战。现有的软件包适用于逐个分析此类数据集的实验文件，但不适用于数千个实验。此外，它们缺乏适当的元数据处理支持。

结果

我们提出了 PyGMQL，这是一种用于处理基于区域的基因组文件及其相对元数据的新型软件，构建在 GMQL 基因组大数据管理系统之上。PyGMQL 提供了一组用于处理区域数据及其元数据的表达性函数，可以扩展到任意集群，并隐式应用于数千个文件，生成数百万个区域。PyGMQL 提供数据互操作性、分布透明性和查询外包。PyGMQL 包集成了基于 Apache Spark 引擎的可扩展数据提取，该引擎是 GMQL 实现的基础，同时为交互式数据分析和可视化提供了本机 Python 支持。它支持数据互操作性，解决了在 Python 中执行基于集合的查询和编程之间的阻抗不匹配问题。PyGMQL 以正交的方式提供分布透明性（寻址远程数据集的能力）和查询外包（将处理分配给远程服务的能力）。外包处理可以解决 GMQL 引擎的基于云的安装问题。

结论

PyGMQL 是支持三级数据提取和分析管道的有效且创新的工具。我们通过一系列越来越复杂的生物学数据分析场景展示了 PyGMQL 的表达能力和性能，突出了可重复性、表达能力和可扩展性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/70b2149b6f6a/12859_2019_3159_Fig1_HTML.jpg

相似文献

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

BMC Bioinformatics. 2019 Nov 8;20(1):560. doi: 10.1186/s12859-019-3159-9.

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.

BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4.

GenoMetric Query Language: a novel approach to large-scale genomic data management.

Bioinformatics. 2015 Jun 15;31(12):1881-8. doi: 10.1093/bioinformatics/btv048. Epub 2015 Feb 3.

Data Management for Heterogeneous Genomic Datasets.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264. doi: 10.1109/TCBB.2016.2576447. Epub 2016 Jun 7.

Explorative visual analytics on interval-based genomic data and their metadata.

BMC Bioinformatics. 2017 Dec 4;18(1):536. doi: 10.1186/s12859-017-1945-9.

Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data.

Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.

Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying.

Methods. 2016 Dec 1;111:3-11. doi: 10.1016/j.ymeth.2016.09.002. Epub 2016 Sep 13.

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics.

BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):138. doi: 10.1186/s12859-019-2694-8.

SeqWare Query Engine: storing and searching sequence data in the cloud.

BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S2. doi: 10.1186/1471-2105-11-S12-S2.

Processing genome-wide association studies within a repository of heterogeneous genomic datasets.

BMC Genom Data. 2023 Mar 3;24(1):13. doi: 10.1186/s12863-023-01111-y.

引用本文的文献

Processing genome-wide association studies within a repository of heterogeneous genomic datasets.

BMC Genom Data. 2023 Mar 3;24(1):13. doi: 10.1186/s12863-023-01111-y.

Framing Apache Spark in life sciences.

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

Genomic data integration and user-defined sample-set extraction for population variant analysis.

BMC Bioinformatics. 2022 Sep 29;23(1):401. doi: 10.1186/s12859-022-04927-0.

GeMI: interactive interface for transformer-based Genomic Metadata Integration.

Database (Oxford). 2022 Jun 3;2022. doi: 10.1093/database/baac036.

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.

BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4.

Data Integration Challenges for Machine Learning in Precision Medicine.

Front Med (Lausanne). 2022 Jan 25;8:784455. doi: 10.3389/fmed.2021.784455. eCollection 2021.

Spatial patterns of CTCF sites define the anatomy of TADs and their boundaries.

Genome Biol. 2020 Aug 12;21(1):197. doi: 10.1186/s13059-020-02108-x.

GenoSurf: metadata driven semantic search system for integrated genomic datasets.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz132.

本文引用的文献

TICA: Transcriptional Interaction and Coregulation Analyzer.

Genomics Proteomics Bioinformatics. 2018 Oct;16(5):342-353. doi: 10.1016/j.gpb.2018.05.004. Epub 2018 Dec 19.

GENCODE reference annotation for the human and mouse genomes.

Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773. doi: 10.1093/nar/gky955.

Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data.

Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.

Bioinformatics applications on Apache Spark.

Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.

Implementing a Transcription Factor Interaction Prediction System Using the GenoMetric Query Language.

Methods Mol Biol. 2018;1807:63-81. doi: 10.1007/978-1-4939-8561-6_6.

The NCI Genomic Data Commons as an engine for precision medicine.

Blood. 2017 Jul 27;130(4):453-459. doi: 10.1182/blood-2017-03-735654. Epub 2017 Jun 9.

Nextflow enables reproducible computational workflows.

Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820.

Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse.

Nucleic Acids Res. 2017 Jan 4;45(D1):D658-D662. doi: 10.1093/nar/gkw983. Epub 2016 Oct 26.

Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying.

Methods. 2016 Dec 1;111:3-11. doi: 10.1016/j.ymeth.2016.09.002. Epub 2016 Sep 13.

Data Management for Heterogeneous Genomic Datasets.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264. doi: 10.1109/TCBB.2016.2576447. Epub 2016 Jun 7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

PyGMQL：用于异构基因组数据集的数据提取和分析的可扩展方法。

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献