长读长和短读短宏基因组组装方法比较用于低丰度物种和抗性基因。
Comparison of long- and short-read metagenomic assembly for low-abundance species and resistance genes.
机构信息
Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Applied Invention, LLC, Cambridge, MA, USA.
出版信息
Brief Bioinform. 2023 Mar 19;24(2). doi: 10.1093/bib/bbad050.
Recent technological and computational advances have made metagenomic assembly a viable approach to achieving high-resolution views of complex microbial communities. In previous benchmarking, short-read (SR) metagenomic assemblers had the highest accuracy, long-read (LR) assemblers generated the most contiguous sequences and hybrid (HY) assemblers balanced length and accuracy. However, no assessments have specifically compared the performance of these assemblers on low-abundance species, which include clinically relevant organisms in the gut. We generated semi-synthetic LR and SR datasets by spiking small and increasing amounts of Escherichia coli isolate reads into fecal metagenomes and, using different assemblers, examined E. coli contigs and the presence of antibiotic resistance genes (ARGs). For ARG assembly, although SR assemblers recovered more ARGs with high accuracy, even at low coverages, LR assemblies allowed for the placement of ARGs within longer, E. coli-specific contigs, thus pinpointing their taxonomic origin. HY assemblies identified resistance genes with high accuracy and had lower contiguity than LR assemblies. Each assembler type's strengths were maintained even when our isolate was spiked in with a competing strain, which fragmented and reduced the accuracy of all assemblies. For strain characterization and determining gene context, LR assembly is optimal, while for base-accurate gene identification, SR assemblers outperform other options. HY assembly offers contiguity and base accuracy, but requires generating data on multiple platforms, and may suffer high misassembly rates when strain diversity exists. Our results highlight the trade-offs associated with each approach for recovering low-abundance taxa, and that the optimal approach is goal-dependent.
最近的技术和计算进展使得宏基因组组装成为实现复杂微生物群落高分辨率视图的可行方法。在以前的基准测试中,短读(SR)宏基因组组装器具有最高的准确性,长读(LR)组装器生成的序列最连续,混合(HY)组装器平衡了长度和准确性。然而,还没有评估专门比较这些组装器在低丰度物种上的性能,这些物种包括肠道中具有临床相关性的生物体。我们通过将少量和增加量的大肠杆菌分离株reads 掺入粪便宏基因组中,生成了半合成的 LR 和 SR 数据集,并使用不同的组装器检查了大肠杆菌 contigs 和抗生素抗性基因(ARGs)的存在。对于 ARG 组装,尽管 SR 组装器以高精度恢复了更多的 ARGs,即使在低覆盖率下,LR 组装也允许将 ARGs 放置在更长的、大肠杆菌特异性 contigs 中,从而确定它们的分类学起源。HY 组装器以高精度识别抗性基因,并且与 LR 组装器相比,连续性较低。即使我们的分离株与竞争菌株混合,每个组装器类型的优势仍然得以保持,这会使所有组装器的准确性降低。对于菌株特征描述和确定基因上下文,LR 组装是最佳选择,而对于碱基准确的基因识别,SR 组装器优于其他选择。HY 组装提供了连续性和碱基准确性,但需要在多个平台上生成数据,并且当存在菌株多样性时,可能会遭受高错误组装率的影响。我们的结果强调了每种方法在恢复低丰度分类群方面的权衡,并且最佳方法取决于目标。