• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用系统级优化加速下一代测序数据分析。

Accelerating next generation sequencing data analysis with system level optimizations.

机构信息

Biomedical Informatics, Research Branch, Sidra Medical and Research Center, Post Box No. 26999, Doha, Qatar.

出版信息

Sci Rep. 2017 Aug 22;7(1):9058. doi: 10.1038/s41598-017-09089-1.

DOI:10.1038/s41598-017-09089-1
PMID:28831090
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5567265/
Abstract

Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.

摘要

下一代测序 (NGS) 数据分析计算量很大。现代计算架构中的一些硬件特性包括内存计算、向量化、批量数据传输、CPU 频率调整等。为了获得最佳的执行时间并利用这些硬件特性,在运行应用程序之前,有必要调整系统级参数。我们研究了 GATK-HaplotypeCaller,它是常见 NGS 工作流程的一部分,占总执行时间的 43%以上。我们对多个 GATK 3.x 版本进行了基准测试,并通过各种系统级参数优化了 HaplotypeCaller 的执行时间,其中包括:(i) 调整并行垃圾收集和内核共享内存以模拟内存计算,(ii) 在 PairHMM 库中进行特定于架构的调整以实现向量化,(iii) 通过 GATK 源代码编译和构建并行排序和批量数据传输的运行时环境来包含 Java 1.8 特性,(iv) 使用“性能模式”将 CPU 频率的默认“按需”模式超频,以加速 Java 多线程。结果,在 GATK 3.3 中,HaplotypeCaller 的执行时间减少了 82.66%,在 GATK 3.7 中减少了 42.61%。总体而言,NGS 管道的执行时间分别减少了 70.60%和 34.14%,适用于 GATK 3.3 和 GATK 3.7。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4494/5567265/3df5b6719781/41598_2017_9089_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4494/5567265/3c58f968b2fa/41598_2017_9089_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4494/5567265/3df5b6719781/41598_2017_9089_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4494/5567265/3c58f968b2fa/41598_2017_9089_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4494/5567265/3df5b6719781/41598_2017_9089_Fig2_HTML.jpg

相似文献

1
Accelerating next generation sequencing data analysis with system level optimizations.利用系统级优化加速下一代测序数据分析。
Sci Rep. 2017 Aug 22;7(1):9058. doi: 10.1038/s41598-017-09089-1.
2
OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow.OVarFlow:一种基于资源优化的 GATK4 的开源变异调用工作流程。
BMC Bioinformatics. 2021 Aug 13;22(1):402. doi: 10.1186/s12859-021-04317-y.
3
Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework.使用 Apache Arrow 内存数据框架优化 GATK 工作流程的性能。
BMC Genomics. 2020 Nov 18;21(Suppl 10):683. doi: 10.1186/s12864-020-07013-y.
4
An analytical workflow for accurate variant discovery in highly divergent regions.一种用于在高度分化区域进行准确变异发现的分析流程。
BMC Genomics. 2016 Sep 2;17(1):703. doi: 10.1186/s12864-016-3045-z.
5
ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark:一种可扩展的基于 Spark 的单倍型调用程序,利用自适应数据分段来加速变异调用。
BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.
6
Evaluation of an optimized germline exomes pipeline using BWA-MEM2 and Dragen-GATK tools.使用 BWA-MEM2 和 Dragen-GATK 工具评估优化后的种系外显子组管道。
PLoS One. 2023 Aug 3;18(8):e0288371. doi: 10.1371/journal.pone.0288371. eCollection 2023.
7
MutAid: Sanger and NGS Based Integrated Pipeline for Mutation Identification, Validation and Annotation in Human Molecular Genetics.MutAid:基于桑格测序法和新一代测序技术的综合流程,用于人类分子遗传学中的突变鉴定、验证及注释
PLoS One. 2016 Feb 3;11(2):e0147697. doi: 10.1371/journal.pone.0147697. eCollection 2016.
8
Accelerating next generation sequencing data analysis: an evaluation of optimized best practices for Genome Analysis Toolkit algorithms.加速下一代测序数据分析:对基因组分析工具包算法优化最佳实践的评估
Genomics Inform. 2020 Mar;18(1):e10. doi: 10.5808/GI.2020.18.1.e10. Epub 2020 Mar 31.
9
Evaluation of serverless computing for scalable execution of a joint variant calling workflow.评估无服务器计算在联合变异调用工作流可伸缩执行中的应用。
PLoS One. 2021 Jul 9;16(7):e0254363. doi: 10.1371/journal.pone.0254363. eCollection 2021.
10
Comparison of GATK and DeepVariant by trio sequencing.基于 trio 测序的 GATK 和 DeepVariant 比较。
Sci Rep. 2022 Feb 2;12(1):1809. doi: 10.1038/s41598-022-05833-4.

引用本文的文献

1
Cold Plasma Treatment Facilitated the Conversion of Lignin-Derived Aldehyde for Pseudomonas putida.冷等离子体处理促进了恶臭假单胞菌对木质素衍生醛的转化。
Appl Biochem Biotechnol. 2025 Feb;197(2):1329-1343. doi: 10.1007/s12010-024-05082-3. Epub 2024 Nov 21.
2
A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset.一种用于加速在25个基因组数据集上进行GATK单核苷酸多态性检测的高性能计算工作流程。
BMC Biol. 2024 Jan 25;22(1):13. doi: 10.1186/s12915-024-01820-5.
3
Use of Next-Generation Sequencing for Identifying Mitochondrial Disorders.

本文引用的文献

1
A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree.通过对一个包含17名成员的三代家系进行测序,经遗传继承验证的540万个定相人类变异的参考数据集。
Genome Res. 2017 Jan;27(1):157-164. doi: 10.1101/gr.210500.116. Epub 2016 Nov 30.
2
Towards precision medicine.迈向精准医学。
Nat Rev Genet. 2016 Aug 16;17(9):507-22. doi: 10.1038/nrg.2016.86.
3
Coming of age: ten years of next-generation sequencing technologies.成年:下一代测序技术的十年
使用下一代测序技术鉴定线粒体疾病。
Curr Issues Mol Biol. 2022 Feb 27;44(3):1127-1148. doi: 10.3390/cimb44030074.
4
Optimized splitting of mixed-species RNA sequencing data.优化混合物种 RNA 测序数据的拆分。
J Bioinform Comput Biol. 2022 Apr;20(2):2250001. doi: 10.1142/S0219720022500019. Epub 2022 Jan 6.
5
OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow.OVarFlow:一种基于资源优化的 GATK4 的开源变异调用工作流程。
BMC Bioinformatics. 2021 Aug 13;22(1):402. doi: 10.1186/s12859-021-04317-y.
6
Recommendations for performance optimizations when using GATK3.8 and GATK4.使用 GATK3.8 和 GATK4 时的性能优化建议。
BMC Bioinformatics. 2019 Nov 8;20(1):557. doi: 10.1186/s12859-019-3169-7.
7
Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy.Sentieon DNASeq变异检测工作流程展现出强大的计算性能和准确性。
Front Genet. 2019 Aug 20;10:736. doi: 10.3389/fgene.2019.00736. eCollection 2019.
Nat Rev Genet. 2016 May 17;17(6):333-51. doi: 10.1038/nrg.2016.49.
4
Leveraging the power of high performance computing for next generation sequencing data analysis: tricks and twists from a high throughput exome workflow.利用高性能计算的力量进行下一代测序数据分析:来自高通量外显子组工作流程的技巧与窍门
PLoS One. 2015 May 5;10(5):e0126321. doi: 10.1371/journal.pone.0126321. eCollection 2015.
5
Halvade: scalable sequence analysis with MapReduce.Halvade:使用MapReduce进行可扩展序列分析。
Bioinformatics. 2015 Aug 1;31(15):2482-8. doi: 10.1093/bioinformatics/btv179. Epub 2015 Mar 26.
6
An analytical framework for optimizing variant discovery from personal genomes.用于优化从个人基因组中发现变异的分析框架。
Nat Commun. 2015 Feb 25;6:6275. doi: 10.1038/ncomms7275.
7
Sambamba: fast processing of NGS alignment formats.Sambamba:快速处理 NGS 比对格式。
Bioinformatics. 2015 Jun 15;31(12):2032-4. doi: 10.1093/bioinformatics/btv098. Epub 2015 Feb 19.
8
Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics.丘吉尔:一种超快速、确定性、高度可扩展且平衡的并行化策略,用于在临床和群体规模基因组学中发现人类遗传变异。
Genome Biol. 2015 Jan 20;16(1):6. doi: 10.1186/s13059-014-0577-x.
9
From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.从FastQ数据到高可信度变异检测:基因组分析工具包最佳实践流程
Curr Protoc Bioinformatics. 2013;43(1110):11.10.1-11.10.33. doi: 10.1002/0471250953.bi1110s43.
10
A survey of tools for variant analysis of next-generation genome sequencing data.下一代基因组测序数据变异分析工具综述。
Brief Bioinform. 2014 Mar;15(2):256-78. doi: 10.1093/bib/bbs086. Epub 2013 Jan 21.