OVarFlow：一种基于资源优化的 GATK4 的开源变异调用工作流程。

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow.

机构信息

Institute of Animal Breeding and Genetics, Justus Liebig University Gießen, Ludwigstraße 21, 35390, Gießen, Germany.

出版信息

BMC Bioinformatics. 2021 Aug 13;22(1):402. doi: 10.1186/s12859-021-04317-y.

DOI:10.1186/s12859-021-04317-y

PMID:34388963

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8361789/

Abstract

BACKGROUND

The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time.

RESULTS

A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half.

CONCLUSIONS

The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.

摘要

背景

下一代测序技术的出现为基础和应用研究开辟了新的途径。一种应用是发现导致表型特征或疾病病理的序列变异。检测和注释目标数据集与参考基因组之间序列差异的计算任务称为“变体调用”。通常，这项任务计算量很大，通常结合了一系列复杂的链接软件工具。在这个领域中，一个主要参与者是基因组分析工具包（GATK）。“GATK 最佳实践”是变体调用的常用配方。然而，当前关于变体调用的计算建议主要集中在人类测序数据上，忽略了高通量测序发展不断变化的需求。此外，频繁更新此类建议与提供标准工作流程的目标背道而驰，并随着时间的推移阻碍可重复性。

结果

一种用于自动检测单核苷酸多态性和插入缺失的工作流程为模型和非模型生物的序列注释提供了广泛的应用。所提出的工作流程基于 GATK 最佳实践，同时实现了随时间的可重复性，并提供了开放、通用的计算架构。该工作流程实现了数据的并行评估，并最大限度地提高了各个计算任务的性能。优化 GATK 应用程序 SortSam、MarkDuplicates、HaplotypeCaller 和 GatherVcfs 的 Java 垃圾收集和堆大小设置有效地将整体分析时间缩短了一半。

结论

变体调用、高效的计算处理和标准化工作流程的需求正在增长。开源变体调用工作流程（OVarFlow）为计算优化的变体调用任务提供了自动化和可重复性。通过减少计算资源的使用，该工作流程消除了变体调用领域以前存在的进入壁垒，并实现了标准化的变体调用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b519/8361789/059c7b3a6deb/12859_2021_4317_Fig1_HTML.jpg

相似文献

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow.OVarFlow：一种基于资源优化的 GATK4 的开源变异调用工作流程。

BMC Bioinformatics. 2021 Aug 13;22(1):402. doi: 10.1186/s12859-021-04317-y.

An analytical workflow for accurate variant discovery in highly divergent regions.一种用于在高度分化区域进行准确变异发现的分析流程。

BMC Genomics. 2016 Sep 2;17(1):703. doi: 10.1186/s12864-016-3045-z.

A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset.一种用于加速在25个基因组数据集上进行GATK单核苷酸多态性检测的高性能计算工作流程。

BMC Biol. 2024 Jan 25;22(1):13. doi: 10.1186/s12915-024-01820-5.

An optimized genomic VCF workflow for precise identification of Mycobacterium tuberculosis cluster from cross-platform whole genome sequencing data.一种优化的基因组 VCF 工作流程，用于从跨平台全基因组测序数据中精确鉴定结核分枝杆菌簇。

Infect Genet Evol. 2020 Apr;79:104152. doi: 10.1016/j.meegid.2019.104152. Epub 2019 Dec 24.

Fast and accurate DNASeq variant calling workflow composed of LUSH toolkit.由 LUSH 工具包组成的快速准确的 DNA 测序变异调用工作流程。

Hum Genomics. 2024 Oct 10;18(1):114. doi: 10.1186/s40246-024-00666-w.

Impact of post-alignment processing in variant discovery from whole exome data.全外显子数据变异发现中比对后处理的影响

BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.

Variant Calling from RNA-seq Data Using the GATK Joint Genotyping Workflow.使用 GATK 联合基因分型工作流程进行 RNA-seq 数据的变异调用。

Methods Mol Biol. 2022;2493:205-233. doi: 10.1007/978-1-0716-2293-3_13.

VC@Scale: Scalable and high-performance variant calling on cluster environments.VC@Scale：在集群环境中进行可扩展且高性能的变体调用。

Gigascience. 2021 Sep 7;10(9). doi: 10.1093/gigascience/giab057.

SNP-SVant: A Computational Workflow to Predict and Annotate Genomic Variants in Organisms Lacking Benchmarked Variants.SNP-SVant：一种在缺乏基准变异的生物中预测和注释基因组变异的计算工作流程。

Curr Protoc. 2024 May;4(5):e1046. doi: 10.1002/cpz1.1046.

ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark：一种可扩展的基于 Spark 的单倍型调用程序，利用自适应数据分段来加速变异调用。

BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.

引用本文的文献

Transcriptome dynamics and allele-specific regulation underlie wheat heterosis at the anthesis and grain-filling stages.转录组动态变化和等位基因特异性调控是小麦花期和灌浆期杂种优势的基础。

BMC Genomics. 2025 Sep 2;26(1):798. doi: 10.1186/s12864-025-11983-2.

Learning-based parallel acceleration for HaplotypeCaller.基于学习的单倍型分型器并行加速技术

BMC Bioinformatics. 2025 Aug 20;26(1):217. doi: 10.1186/s12859-025-06242-w.

Fine Mapping Identifies Candidate Genes Associated with Swine Inflammation and Necrosis Syndrome.精细定位确定与猪炎症和坏死综合征相关的候选基因。

Vet Sci. 2025 May 21;12(5):508. doi: 10.3390/vetsci12050508.

Whole-Genome Resequencing Analysis of Copy Number Variations Associated with Athletic Performance in Grassland-Thoroughbred.草原纯血马运动成绩相关拷贝数变异的全基因组重测序分析

Animals (Basel). 2025 May 18;15(10):1458. doi: 10.3390/ani15101458.

Genome-Wide Association Study Reveals Single Nucleotide Polymorphisms Associated with Tail Length and Tail Kinks in Piglets.全基因组关联研究揭示了与仔猪尾巴长度和尾巴弯曲相关的单核苷酸多态性。

Vet Sci. 2025 Feb 24;12(3):198. doi: 10.3390/vetsci12030198.

Comprehensive Molecular and Genomic Analysis of NCI-MATCH Subprotocol Y: Capivasertib in Patients With an -Mutated Tumor.NCI-MATCH子方案Y的综合分子和基因组分析：卡比替尼治疗携带KRAS突变肿瘤的患者

JCO Precis Oncol. 2025 Mar;9:e2400614. doi: 10.1200/PO-24-00614. Epub 2025 Mar 28.

Genome-Wide Association Studies for Lactation Performance in Buffaloes.水牛泌乳性能的全基因组关联研究

Genes (Basel). 2025 Jan 27;16(2):163. doi: 10.3390/genes16020163.

Emerging epidemic of the Africa-type plasmid in penicillinase-producing in Guangdong, China, 2013-2022.2013 - 2022年中国广东产青霉素酶菌株中非洲型质粒的新出现流行情况

Emerg Microbes Infect. 2025 Dec;14(1):2440489. doi: 10.1080/22221751.2024.2440489. Epub 2024 Dec 26.

RNA-Seq based selection signature analysis for identifying genomic footprints associated with the fat-tail phenotype in sheep.基于RNA测序的选择特征分析，用于鉴定与绵羊肥尾表型相关的基因组印记。

Front Vet Sci. 2024 Sep 30;11:1415027. doi: 10.3389/fvets.2024.1415027. eCollection 2024.

Sterol 14-alpha demethylase (CYP51) activity in Leishmania donovani is likely dependent upon cytochrome P450 reductase 1.利什曼原虫中的甾醇 14-α 脱甲基酶（CYP51）的活性可能依赖于细胞色素 P450 还原酶 1。

PLoS Pathog. 2024 Jul 11;20(7):e1012382. doi: 10.1371/journal.ppat.1012382. eCollection 2024 Jul.

本文引用的文献

Twelve years of SAMtools and BCFtools.SAMtools 和 BCFtools 十二年。

Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab008.

DeepVariant-on-Spark: Small-Scale Genome Analysis Using a Cloud-Based Computing Framework.DeepVariant-on-Spark：使用基于云的计算框架进行小规模基因组分析。

Comput Math Methods Med. 2020 Sep 1;2020:7231205. doi: 10.1155/2020/7231205. eCollection 2020.

Benchmarking variant callers in next-generation and third-generation sequencing analysis.在新一代和第三代测序分析中对变异调用程序进行基准测试。

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa148.

Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage.系统剖析全外显子组测序和全基因组测序中的偏倚揭示了编码序列覆盖的主要决定因素。

Sci Rep. 2020 Feb 6;10(1):2057. doi: 10.1038/s41598-020-59026-y.

Recommendations for performance optimizations when using GATK3.8 and GATK4.使用 GATK3.8 和 GATK4 时的性能优化建议。

BMC Bioinformatics. 2019 Nov 8;20(1):557. doi: 10.1186/s12859-019-3169-7.

Challenges and recommendations to improve the installability and archival stability of omics computational tools.提高组学计算工具可安装性和档案稳定性的挑战和建议。

PLoS Biol. 2019 Jun 20;17(6):e3000333. doi: 10.1371/journal.pbio.3000333. eCollection 2019 Jun.

Deep Genome Resequencing Reveals Artificial and Natural Selection for Visual Deterioration, Plateau Adaptability and High Prolificacy in Chinese Domestic Sheep.深度基因组重测序揭示了中国家羊视觉退化、高原适应性和高繁殖力的人工选择与自然选择

Front Genet. 2019 Apr 2;10:300. doi: 10.3389/fgene.2019.00300. eCollection 2019.

Comparative analysis of the chicken IFITM locus by targeted genome sequencing reveals evolution of the locus and positive selection in IFITM1 and IFITM3.通过靶向基因组测序对鸡 IFITM 基因座的比较分析揭示了该基因座的进化以及 IFITM1 和 IFITM3 中的正选择。

BMC Genomics. 2019 Apr 5;20(1):272. doi: 10.1186/s12864-019-5621-5.

Comparison of three variant callers for human whole genome sequencing.三种人类全基因组测序变异 caller 的比较。

Sci Rep. 2018 Dec 14;8(1):17851. doi: 10.1038/s41598-018-36177-7.

No more excuses for non-reproducible methods.不要再为不可重复的方法找借口了。

Nature. 2018 Aug;560(7719):411. doi: 10.1038/d41586-018-06008-w.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

OVarFlow：一种基于资源优化的 GATK4 的开源变异调用工作流程。

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献