USDA, Agricultural Research Service, Jamie Whitten Delta States Research Center, Genomics and Bioinformatics Research Unit, Stoneville, Mississippi.
Oak Ridge Institute for Science and Education, Oak Ridge, Tennessee.
Genome Biol Evol. 2023 Mar 3;15(3). doi: 10.1093/gbe/evad020.
Long-read sequencing has revolutionized genome assembly, yielding highly contiguous, chromosome-level contigs. However, assemblies from some third generation long read technologies, such as Pacific Biosciences (PacBio) continuous long reads (CLR), have a high error rate. Such errors can be corrected with short reads through a process called polishing. Although best practices for polishing non-model de novo genome assemblies were recently described by the Vertebrate Genome Project (VGP) Assembly community, there is a need for a publicly available, reproducible workflow that can be easily implemented and run on a conventional high performance computing environment. Here, we describe polishCLR (https://github.com/isugifNF/polishCLR), a reproducible Nextflow workflow that implements best practices for polishing assemblies made from CLR data. PolishCLR can be initiated from several input options that extend best practices to suboptimal cases. It also provides re-entry points throughout several key processes, including identifying duplicate haplotypes in purge_dups, allowing a break for scaffolding if data are available, and throughout multiple rounds of polishing and evaluation with Arrow and FreeBayes. PolishCLR is containerized and publicly available for the greater assembly community as a tool to complete assemblies from existing, error-prone long-read data.
长读测序技术彻底改变了基因组组装,生成了高度连续的染色体级别的 contigs。然而,一些第三代长读测序技术(如 Pacific Biosciences (PacBio) 连续长读测序 (CLR))的组装结果错误率较高。这些错误可以通过称为“polishing”的过程使用短读序列来纠正。尽管最近脊椎动物基因组计划 (VGP) 组装社区描述了针对非模式从头组装的最佳 polish 实践,但仍需要一个可公开获取、可重现的工作流程,以便在常规高性能计算环境中轻松实现和运行。在这里,我们描述了 polishCLR(https://github.com/isugifNF/polishCLR),这是一个可重现的 Nextflow 工作流程,它实现了从 CLR 数据组装的最佳实践。polishCLR 可以从多个输入选项启动,这些选项将最佳实践扩展到了非最优情况。它还在多个关键流程中提供了重新进入点,包括在 purge_dups 中识别重复的单倍型,允许在有数据的情况下暂停支架构建,以及在多个 Arrow 和 FreeBayes 的 polish 和评估循环中。polishCLR 是一个容器化的工具,可供更广泛的组装社区使用,用于完成现有易错长读数据的组装。