Institute of Animal Breeding and Genetics, Justus Liebig University Gießen, Ludwigstraße 21, 35390, Gießen, Germany.
BMC Bioinformatics. 2021 Aug 13;22(1):402. doi: 10.1186/s12859-021-04317-y.
The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time.
A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half.
The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.
下一代测序技术的出现为基础和应用研究开辟了新的途径。一种应用是发现导致表型特征或疾病病理的序列变异。检测和注释目标数据集与参考基因组之间序列差异的计算任务称为“变体调用”。通常,这项任务计算量很大,通常结合了一系列复杂的链接软件工具。在这个领域中,一个主要参与者是基因组分析工具包(GATK)。“GATK 最佳实践”是变体调用的常用配方。然而,当前关于变体调用的计算建议主要集中在人类测序数据上,忽略了高通量测序发展不断变化的需求。此外,频繁更新此类建议与提供标准工作流程的目标背道而驰,并随着时间的推移阻碍可重复性。
一种用于自动检测单核苷酸多态性和插入缺失的工作流程为模型和非模型生物的序列注释提供了广泛的应用。所提出的工作流程基于 GATK 最佳实践,同时实现了随时间的可重复性,并提供了开放、通用的计算架构。该工作流程实现了数据的并行评估,并最大限度地提高了各个计算任务的性能。优化 GATK 应用程序 SortSam、MarkDuplicates、HaplotypeCaller 和 GatherVcfs 的 Java 垃圾收集和堆大小设置有效地将整体分析时间缩短了一半。
变体调用、高效的计算处理和标准化工作流程的需求正在增长。开源变体调用工作流程(OVarFlow)为计算优化的变体调用任务提供了自动化和可重复性。通过减少计算资源的使用,该工作流程消除了变体调用领域以前存在的进入壁垒,并实现了标准化的变体调用。