Pacific Biosciences, 1305 O'Brien Dr, Menlo Park, CA, 93025, USA.
Department of Population Health and Reproduction, University of California Davis, Davis, CA, USA.
BMC Bioinformatics. 2022 Dec 13;23(1):541. doi: 10.1186/s12859-022-05103-0.
Long-read shotgun metagenomic sequencing is gaining in popularity and offers many advantages over short-read sequencing. The higher information content in long reads is useful for a variety of metagenomics analyses, including taxonomic classification and profiling. The development of long-read specific tools for taxonomic classification is accelerating, yet there is a lack of information regarding their relative performance. Here, we perform a critical benchmarking study using 11 methods, including five methods designed specifically for long reads. We applied these tools to several mock community datasets generated using Pacific Biosciences (PacBio) HiFi or Oxford Nanopore Technology sequencing, and evaluated their performance based on read utilization, detection metrics, and relative abundance estimates.
Our results show that long-read classifiers generally performed best. Several short-read classification and profiling methods produced many false positives (particularly at lower abundances), required heavy filtering to achieve acceptable precision (at the cost of reduced recall), and produced inaccurate abundance estimates. By contrast, two long-read methods (BugSeq, MEGAN-LR & DIAMOND) and one generalized method (sourmash) displayed high precision and recall without any filtering required. Furthermore, in the PacBio HiFi datasets these methods detected all species down to the 0.1% abundance level with high precision. Some long-read methods, such as MetaMaps and MMseqs2, required moderate filtering to reduce false positives to resemble the precision and recall of the top-performing methods. We found read quality affected performance for methods relying on protein prediction or exact k-mer matching, and these methods performed better with PacBio HiFi datasets. We also found that long-read datasets with a large proportion of shorter reads (< 2 kb length) resulted in lower precision and worse abundance estimates, relative to length-filtered datasets. Finally, for classification methods, we found that the long-read datasets produced significantly better results than short-read datasets, demonstrating clear advantages for long-read metagenomic sequencing.
Our critical assessment of available methods provides best-practice recommendations for current research using long reads and establishes a baseline for future benchmarking studies.
长读测序在宏基因组学中越来越受欢迎,相较于短读测序具有许多优势。长读的高信息量对于多种宏基因组学分析(包括分类学和分析)都非常有用。专门用于分类学的长读工具的发展正在加速,但关于它们的相对性能的信息却很少。在这里,我们使用 11 种方法进行了一项关键的基准测试研究,其中包括 5 种专门为长读设计的方法。我们将这些工具应用于使用 Pacific Biosciences (PacBio) HiFi 或 Oxford Nanopore Technology 测序生成的几个模拟群落数据集,并根据读取利用率、检测指标和相对丰度估计值来评估它们的性能。
我们的结果表明,长读分类器通常表现最好。几种短读分类和分析方法产生了许多假阳性(尤其是在较低丰度下),需要进行大量过滤才能达到可接受的精度(但会降低召回率),并且产生的丰度估计不准确。相比之下,两种长读方法(BugSeq、MEGAN-LR 和 DIAMOND)和一种通用方法(sourmash)无需过滤即可显示出高精度和高召回率。此外,在 PacBio HiFi 数据集上,这些方法可以检测到所有物种,其丰度低至 0.1%,且具有很高的精度。一些长读方法,如 MetaMaps 和 MMseqs2,需要适度过滤以减少假阳性,从而类似于表现最佳的方法的精度和召回率。我们发现,依赖于蛋白质预测或精确 k-mer 匹配的方法的性能受到读取质量的影响,这些方法在使用 PacBio HiFi 数据集时表现更好。我们还发现,与经过长度过滤的数据集相比,具有较大比例较短读取(<2kb 长度)的长读数据集会导致精度降低和丰度估计更差。最后,对于分类方法,我们发现长读数据集产生的结果明显优于短读数据集,这表明长读宏基因组测序具有明显的优势。
我们对现有方法的评估为当前使用长读的研究提供了最佳实践建议,并为未来的基准测试研究奠定了基础。