• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PyGMQL:用于异构基因组数据集的数据提取和分析的可扩展方法。

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

机构信息

Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.

出版信息

BMC Bioinformatics. 2019 Nov 8;20(1):560. doi: 10.1186/s12859-019-3159-9.

DOI:10.1186/s12859-019-3159-9
PMID:31703553
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6842186/
Abstract

BACKGROUND

With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation.

RESULTS

We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine.

CONCLUSIONS

PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

摘要

背景

随着可用测序数据集的增长,对异质处理数据的分析可以回答越来越相关的生物学和临床问题。科学家们在对异质处理数据集执行高效且可重复的数据提取和分析管道方面面临挑战。现有的软件包适用于逐个分析此类数据集的实验文件,但不适用于数千个实验。此外,它们缺乏适当的元数据处理支持。

结果

我们提出了 PyGMQL,这是一种用于处理基于区域的基因组文件及其相对元数据的新型软件,构建在 GMQL 基因组大数据管理系统之上。PyGMQL 提供了一组用于处理区域数据及其元数据的表达性函数,可以扩展到任意集群,并隐式应用于数千个文件,生成数百万个区域。PyGMQL 提供数据互操作性、分布透明性和查询外包。PyGMQL 包集成了基于 Apache Spark 引擎的可扩展数据提取,该引擎是 GMQL 实现的基础,同时为交互式数据分析和可视化提供了本机 Python 支持。它支持数据互操作性,解决了在 Python 中执行基于集合的查询和编程之间的阻抗不匹配问题。PyGMQL 以正交的方式提供分布透明性(寻址远程数据集的能力)和查询外包(将处理分配给远程服务的能力)。外包处理可以解决 GMQL 引擎的基于云的安装问题。

结论

PyGMQL 是支持三级数据提取和分析管道的有效且创新的工具。我们通过一系列越来越复杂的生物学数据分析场景展示了 PyGMQL 的表达能力和性能,突出了可重复性、表达能力和可扩展性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/1249b3117ad6/12859_2019_3159_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/70b2149b6f6a/12859_2019_3159_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/5866aec2d02d/12859_2019_3159_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/770f66e3d872/12859_2019_3159_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/1484696e4857/12859_2019_3159_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/1249b3117ad6/12859_2019_3159_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/70b2149b6f6a/12859_2019_3159_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/5866aec2d02d/12859_2019_3159_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/770f66e3d872/12859_2019_3159_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/1484696e4857/12859_2019_3159_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fa2/6842186/1249b3117ad6/12859_2019_3159_Fig5_HTML.jpg

相似文献

1
PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.PyGMQL:用于异构基因组数据集的数据提取和分析的可扩展方法。
BMC Bioinformatics. 2019 Nov 8;20(1):560. doi: 10.1186/s12859-019-3159-9.
2
RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.RGMQL:在 R/Bioconductor 中可扩展和互操作的异构组学大数据和元数据的计算。
BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4.
3
GenoMetric Query Language: a novel approach to large-scale genomic data management.基因组查询语言:一种大规模基因组数据管理的新方法。
Bioinformatics. 2015 Jun 15;31(12):1881-8. doi: 10.1093/bioinformatics/btv048. Epub 2015 Feb 3.
4
Data Management for Heterogeneous Genomic Datasets.异构基因组数据集的数据管理。
IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264. doi: 10.1109/TCBB.2016.2576447. Epub 2016 Jun 7.
5
Explorative visual analytics on interval-based genomic data and their metadata.基于区间的基因组数据及其元数据的探索性可视化分析。
BMC Bioinformatics. 2017 Dec 4;18(1):536. doi: 10.1186/s12859-017-1945-9.
6
Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data.大异质基因组数据集的处理,用于下一代测序数据的三级分析。
Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.
7
Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying.用于综合处理和查询的异构基因组大数据建模与互操作性
Methods. 2016 Dec 1;111:3-11. doi: 10.1016/j.ymeth.2016.09.002. Epub 2016 Sep 13.
8
Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics.分析基因组序列的大数据集:快速可扩展的 k-mer 统计信息收集。
BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):138. doi: 10.1186/s12859-019-2694-8.
9
SeqWare Query Engine: storing and searching sequence data in the cloud.SeqWare 查询引擎:在云端存储和搜索序列数据。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S2. doi: 10.1186/1471-2105-11-S12-S2.
10
Processing genome-wide association studies within a repository of heterogeneous genomic datasets.在异构基因组数据集存储库中处理全基因组关联研究。
BMC Genom Data. 2023 Mar 3;24(1):13. doi: 10.1186/s12863-023-01111-y.

引用本文的文献

1
Processing genome-wide association studies within a repository of heterogeneous genomic datasets.在异构基因组数据集存储库中处理全基因组关联研究。
BMC Genom Data. 2023 Mar 3;24(1):13. doi: 10.1186/s12863-023-01111-y.
2
Framing Apache Spark in life sciences.从生命科学角度构建Apache Spark
Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.
3
Genomic data integration and user-defined sample-set extraction for population variant analysis.用于群体变异分析的基因组数据集成和用户定义的样本集提取。

本文引用的文献

1
TICA: Transcriptional Interaction and Coregulation Analyzer.TICA:转录相互作用和协同调控分析器。
Genomics Proteomics Bioinformatics. 2018 Oct;16(5):342-353. doi: 10.1016/j.gpb.2018.05.004. Epub 2018 Dec 19.
2
GENCODE reference annotation for the human and mouse genomes.GENCODE 人类和小鼠基因组参考注释。
Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773. doi: 10.1093/nar/gky955.
3
Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data.大异质基因组数据集的处理,用于下一代测序数据的三级分析。
BMC Bioinformatics. 2022 Sep 29;23(1):401. doi: 10.1186/s12859-022-04927-0.
4
GeMI: interactive interface for transformer-based Genomic Metadata Integration.GeMI:基于转换器的基因组元数据集成的交互式接口。
Database (Oxford). 2022 Jun 3;2022. doi: 10.1093/database/baac036.
5
RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.RGMQL:在 R/Bioconductor 中可扩展和互操作的异构组学大数据和元数据的计算。
BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4.
6
Data Integration Challenges for Machine Learning in Precision Medicine.精准医学中机器学习的数据整合挑战
Front Med (Lausanne). 2022 Jan 25;8:784455. doi: 10.3389/fmed.2021.784455. eCollection 2021.
7
Spatial patterns of CTCF sites define the anatomy of TADs and their boundaries.CTCF 结合位点的空间分布模式决定了 TAD 的结构及其边界。
Genome Biol. 2020 Aug 12;21(1):197. doi: 10.1186/s13059-020-02108-x.
8
GenoSurf: metadata driven semantic search system for integrated genomic datasets.GenoSurf:元数据驱动的语义搜索系统,用于整合基因组数据集。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz132.
Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.
4
Bioinformatics applications on Apache Spark.基于 Apache Spark 的生物信息学应用。
Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.
5
Implementing a Transcription Factor Interaction Prediction System Using the GenoMetric Query Language.使用基因组查询语言实现转录因子相互作用预测系统
Methods Mol Biol. 2018;1807:63-81. doi: 10.1007/978-1-4939-8561-6_6.
6
The NCI Genomic Data Commons as an engine for precision medicine.美国国立癌症研究所基因组数据共享库作为精准医学的引擎。
Blood. 2017 Jul 27;130(4):453-459. doi: 10.1182/blood-2017-03-735654. Epub 2017 Jun 9.
7
Nextflow enables reproducible computational workflows.Nextflow支持可重复的计算工作流程。
Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820.
8
Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse.顺式作用元件数据浏览器:一个用于人类和小鼠的ChIP-Seq及染色质可及性数据的数据门户。
Nucleic Acids Res. 2017 Jan 4;45(D1):D658-D662. doi: 10.1093/nar/gkw983. Epub 2016 Oct 26.
9
Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying.用于综合处理和查询的异构基因组大数据建模与互操作性
Methods. 2016 Dec 1;111:3-11. doi: 10.1016/j.ymeth.2016.09.002. Epub 2016 Sep 13.
10
Data Management for Heterogeneous Genomic Datasets.异构基因组数据集的数据管理。
IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264. doi: 10.1109/TCBB.2016.2576447. Epub 2016 Jun 7.