Department of Computer Science, University of Maryland, College Park, MD, 20742, USA.
Center for Food Safety and Applied Nutrition, Food and Drug Administration, Laurel, MD, 20708, USA.
BMC Genomics. 2024 Jul 8;25(1):679. doi: 10.1186/s12864-024-10582-x.
Oxford Nanopore provides high throughput sequencing platforms able to reconstruct complete bacterial genomes with 99.95% accuracy. However, even small levels of error can obscure the phylogenetic relationships between closely related isolates. Polishing tools have been developed to correct these errors, but it is uncertain if they obtain the accuracy needed for the high-resolution source tracking of foodborne illness outbreaks.
We tested 132 combinations of assembly and short- and long-read polishing tools to assess their accuracy for reconstructing the genome sequences of 15 highly similar Salmonella enterica serovar Newport isolates from a 2020 onion outbreak. While long-read polishing alone improved accuracy, near perfect accuracy (99.9999% accuracy or ~ 5 nucleotide errors across the 4.8 Mbp genome, excluding low confidence regions) was only obtained by pipelines that combined both long- and short-read polishing tools. Notably, medaka was a more accurate and efficient long-read polisher than Racon. Among short-read polishers, NextPolish showed the highest accuracy, but Pilon, Polypolish, and POLCA performed similarly. Among the 5 best performing pipelines, polishing with medaka followed by NextPolish was the most common combination. Importantly, the order of polishing tools mattered i.e., using less accurate tools after more accurate ones introduced errors. Indels in homopolymers and repetitive regions, where the short reads could not be uniquely mapped, remained the most challenging errors to correct.
Short reads are still needed to correct errors in nanopore sequenced assemblies to obtain the accuracy required for source tracking investigations. Our granular assessment of the performance of the polishing pipelines allowed us to suggest best practices for tool users and areas for improvement for tool developers.
牛津纳米孔提供高通量测序平台,能够以 99.95%的准确率重建完整的细菌基因组。然而,即使是很小的错误水平也会掩盖密切相关分离株之间的系统发育关系。已经开发了抛光工具来纠正这些错误,但不确定它们是否能获得用于高分辨率食物源追踪的爆发所需的准确性。
我们测试了 132 种组合的组装和短读和长读抛光工具,以评估它们用于重建 2020 年洋葱爆发中 15 个高度相似的肠炎沙门氏菌纽波特血清型分离株基因组序列的准确性。虽然单独使用长读抛光可以提高准确性,但只有结合使用长读和短读抛光工具的管道才能获得近乎完美的准确性(在 480 万碱基对基因组中,准确率为 99.9999%,或~5 个核苷酸错误,不包括置信度低的区域)。值得注意的是,medaka 是一种比 Racon 更准确和高效的长读抛光机。在短读抛光机中,NextPolish 显示出最高的准确性,但 Pilon、Polypolish 和 POLCA 表现相似。在 5 个表现最好的管道中,用 medaka 进行抛光,然后用 NextPolish 进行抛光是最常见的组合。重要的是,抛光工具的顺序很重要,即用不太准确的工具在后会引入错误。在短读无法唯一映射的同聚物和重复区域中的插入缺失,仍然是最难纠正的错误。
为了获得用于源追踪调查的准确性,仍然需要使用短读来纠正纳米孔测序组装中的错误。我们对抛光管道性能的详细评估使我们能够为工具使用者提供最佳实践建议,并为工具开发者提供改进的领域。