• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

并发对全外显子组测序流程性能的影响。

Impact of concurrency on the performance of a whole exome sequencing pipeline.

机构信息

Department of Physics and Astronomy, University of Bologna, 40127, Bologna, BO, Italy.

Department of Experimental, Diagnostic and Specialty Medicine, University of Bologna, 40138, Bologna, BO, Italy.

出版信息

BMC Bioinformatics. 2021 Feb 9;22(1):60. doi: 10.1186/s12859-020-03780-3.

DOI:10.1186/s12859-020-03780-3
PMID:33563206
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7874478/
Abstract

BACKGROUND

Current high-throughput technologies-i.e. whole genome sequencing, RNA-Seq, ChIP-Seq, etc.-generate huge amounts of data and their usage gets more widespread with each passing year. Complex analysis pipelines involving several computationally-intensive steps have to be applied on an increasing number of samples. Workflow management systems allow parallelization and a more efficient usage of computational power. Nevertheless, this mostly happens by assigning the available cores to a single or few samples' pipeline at a time. We refer to this approach as naive parallel strategy (NPS). Here, we discuss an alternative approach, which we refer to as concurrent execution strategy (CES), which equally distributes the available processors across every sample's pipeline.

RESULTS

Theoretically, we show that the CES results, under loose conditions, in a substantial speedup, with an ideal gain range spanning from 1 to the number of samples. Also, we observe that the CES yields even faster executions since parallelly computable tasks scale sub-linearly. Practically, we tested both strategies on a whole exome sequencing pipeline applied to three publicly available matched tumour-normal sample pairs of gastrointestinal stromal tumour. The CES achieved speedups in latency up to 2-2.4 compared to the NPS.

CONCLUSIONS

Our results hint that if resources distribution is further tailored to fit specific situations, an even greater gain in performance of multiple samples pipelines execution could be achieved. For this to be feasible, a benchmarking of the tools included in the pipeline would be necessary. It is our opinion these benchmarks should be consistently performed by the tools' developers. Finally, these results suggest that concurrent strategies might also lead to energy and cost savings by making feasible the usage of low power machine clusters.

摘要

背景

当前的高通量技术——例如全基因组测序、RNA-Seq、ChIP-Seq 等——产生了大量的数据,并且随着时间的推移,它们的使用越来越广泛。涉及多个计算密集型步骤的复杂分析管道必须应用于越来越多的样本。工作流管理系统允许并行化和更有效地利用计算能力。然而,这主要是通过将可用的核心一次分配给一个或几个样本的管道来实现的。我们将这种方法称为朴素并行策略(NPS)。在这里,我们讨论一种替代方法,我们称之为并发执行策略(CES),它将可用的处理器平均分配到每个样本的管道中。

结果

从理论上讲,我们表明,在宽松的条件下,CES 会导致实质性的加速,理想的增益范围从 1 到样本数量。此外,我们观察到,由于并行可计算任务呈次线性扩展,CES 产生的执行速度甚至更快。实际上,我们在应用于三个公开可用的胃肠道间质瘤匹配肿瘤-正常样本对的全外显子测序管道上测试了这两种策略。与 NPS 相比,CES 在延迟方面实现了高达 2-2.4 倍的加速。

结论

我们的结果表明,如果进一步调整资源分配以适应特定情况,那么可以实现多个样本管道执行性能的更大提升。为此,有必要对管道中包含的工具进行基准测试。我们认为,这些基准测试应由工具的开发人员来执行。最后,这些结果表明,通过使低功耗机器集群的使用成为可能,并发策略也可能导致能源和成本节约。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/3dd10f5c7d82/12859_2020_3780_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/75fc38a03a5f/12859_2020_3780_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/a2d718b294d7/12859_2020_3780_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/767aa1f0b19b/12859_2020_3780_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/041ac8fb5b39/12859_2020_3780_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/3dd10f5c7d82/12859_2020_3780_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/75fc38a03a5f/12859_2020_3780_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/a2d718b294d7/12859_2020_3780_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/767aa1f0b19b/12859_2020_3780_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/041ac8fb5b39/12859_2020_3780_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0e4/7874478/3dd10f5c7d82/12859_2020_3780_Fig5_HTML.jpg

相似文献

1
Impact of concurrency on the performance of a whole exome sequencing pipeline.并发对全外显子组测序流程性能的影响。
BMC Bioinformatics. 2021 Feb 9;22(1):60. doi: 10.1186/s12859-020-03780-3.
2
CoBRA: Containerized Bioinformatics Workflow for Reproducible ChIP/ATAC-seq Analysis.CoBRA:用于可重复 ChIP/ATAC-seq 分析的集装箱化生物信息学工作流程。
Genomics Proteomics Bioinformatics. 2021 Aug;19(4):652-661. doi: 10.1016/j.gpb.2020.11.007. Epub 2021 Jul 18.
3
A graph-based approach for designing extensible pipelines.基于图的可扩展流水线设计方法。
BMC Bioinformatics. 2012 Jul 12;13:163. doi: 10.1186/1471-2105-13-163.
4
Bio-Docklets: virtualization containers for single-step execution of NGS pipelines.生物小容器:用于下一代测序流程单步执行的虚拟化容器。
Gigascience. 2017 Aug 1;6(8):1-7. doi: 10.1093/gigascience/gix048.
5
Bioinformatics Core Workflow for ChIP-Seq Data Analysis.生物信息学核心工作流程用于 ChIP-Seq 数据分析。
Methods Mol Biol. 2024;2846:47-62. doi: 10.1007/978-1-0716-4071-5_4.
6
SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines.SimBA:一种用于评估RNA测序生物信息学流程性能的方法和工具。
BMC Bioinformatics. 2017 Sep 29;18(1):428. doi: 10.1186/s12859-017-1831-5.
7
Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework.使用 Apache Arrow 内存数据框架优化 GATK 工作流程的性能。
BMC Genomics. 2020 Nov 18;21(Suppl 10):683. doi: 10.1186/s12864-020-07013-y.
8
A community-based resource for automatic exome variant-calling and annotation in Mendelian disorders.一个基于社区的用于孟德尔疾病中自动外显子组变异检测和注释的资源。
BMC Genomics. 2014;15 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2164-15-S3-S5. Epub 2014 May 6.
9
Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis.用于全面单组ChIP-seq分析的可重复使用、可扩展且可修改的R脚本和开普勒工作流程。
BMC Bioinformatics. 2016 Jul 5;17(1):270. doi: 10.1186/s12859-016-1125-3.
10
JWES: a new pipeline for whole genome/exome sequence data processing, management, and gene-variant discovery, annotation, prediction, and genotyping.JWES:一个用于全基因组/外显子组序列数据处理、管理以及基因变异发现、注释、预测和基因分型的新管道。
FEBS Open Bio. 2021 Sep;11(9):2441-2452. doi: 10.1002/2211-5463.13261. Epub 2021 Aug 11.

引用本文的文献

1
Genomic, transcriptomic and RNA editing analysis of human MM1 and VV2 sporadic Creutzfeldt-Jakob disease.人类 MM1 和 VV2 散发性克雅氏病的基因组、转录组和 RNA 编辑分析。
Acta Neuropathol Commun. 2022 Dec 14;10(1):181. doi: 10.1186/s40478-022-01483-9.
2
Correction to: Impact of concurrency on the performance of a whole exome sequencing pipeline.对《并发对全外显子测序流程性能的影响》的更正
BMC Bioinformatics. 2021 Jun 1;22(1):292. doi: 10.1186/s12859-021-04205-5.

本文引用的文献

1
CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language.CWL-Airflow:一个支持通用工作流程语言的轻量级管道管理器。
Gigascience. 2019 Jul 1;8(7). doi: 10.1093/gigascience/giz084.
2
VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis.VIPER:RNA-seq 可视化管道,一个 Snakemake 工作流程,用于高效完整的 RNA-seq 分析。
BMC Bioinformatics. 2018 Apr 12;19(1):135. doi: 10.1186/s12859-018-2139-9.
3
Watchdog - a workflow management system for the distributed analysis of large-scale experimental data.
Watchdog - 一种用于大规模实验数据分析的分布式工作流管理系统。
BMC Bioinformatics. 2018 Mar 13;19(1):97. doi: 10.1186/s12859-018-2107-4.
4
MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling.MetaMeta:整合宏基因组分析工具以改善分类剖析。
Microbiome. 2017 Aug 14;5(1):101. doi: 10.1186/s40168-017-0318-y.
5
Kronos: a workflow assembler for genome analytics and informatics.Kronos:一个用于基因组分析和信息学的工作流组装器。
Gigascience. 2017 Jul 1;6(7):1-10. doi: 10.1093/gigascience/gix042.
6
An automated workflow for parallel processing of large multiview SPIM recordings.一种用于大型多视图选择性平面照明显微镜(SPIM)记录并行处理的自动化工作流程。
Bioinformatics. 2016 Apr 1;32(7):1112-4. doi: 10.1093/bioinformatics/btv706. Epub 2015 Dec 1.
7
Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples.检测不纯和异质癌症样本中的体细胞点突变。
Nat Biotechnol. 2013 Mar;31(3):213-9. doi: 10.1038/nbt.2514. Epub 2013 Feb 10.
8
Snakemake--a scalable bioinformatics workflow engine.Snakemake——一个可扩展的生物信息学工作流引擎。
Bioinformatics. 2012 Oct 1;28(19):2520-2. doi: 10.1093/bioinformatics/bts480. Epub 2012 Aug 20.
9
MuSiC: identifying mutational significance in cancer genomes.MuSiC:识别癌症基因组中的突变意义。
Genome Res. 2012 Aug;22(8):1589-98. doi: 10.1101/gr.134635.111. Epub 2012 Jul 3.
10
AdapterRemoval: easy cleaning of next-generation sequencing reads.AdapterRemoval:轻松清理新一代测序读数。
BMC Res Notes. 2012 Jul 2;5:337. doi: 10.1186/1756-0500-5-337.