Suppr超能文献

重复数据删除提高了微生物组研究中鸟枪法宏基因组组装和分箱的成本效益及产量。

Deduplication Improves Cost-Efficiency and Yields of Assembly and Binning of Shotgun Metagenomes in Microbiome Research.

作者信息

Zhang Zhiguo, Zhang Lu, Zhang Guoqing, Zhao Ze, Wang Hui, Ju Feng

机构信息

College of Environmental and Resources Sciences, Zhejiang University, Hangzhou, Zhejiang Province, China.

Research Center for Industries of the Future, Key Laboratory of Coastal Environment and Resources of Zhejiang Province, School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China.

出版信息

Microbiol Spectr. 2023 Feb 6;11(2):e0428222. doi: 10.1128/spectrum.04282-22.

Abstract

In the last decade, metagenomics has greatly revolutionized the study of microbial communities. However, the presence of artificial duplicate reads raised mainly from the preparation of metagenomic DNA sequencing libraries and their impacts on metagenomic assembly and binning have never been brought to attention. Here, we explicitly investigated the effects of duplicate reads on metagenomic assemblies and binning based on analyses of five groups of representative metagenomes with distinct microbiome complexities. Our results showed that deduplication considerably increased the binning yields (by 3.5% to 80%) for most of the metagenomic data sets examined thanks to the improved contig length and coverage profiling of metagenome-assembled contigs, whereas it slightly decreased the binning yields of metagenomes with low complexity (e.g., human gut metagenomes). Specifically, 411 versus 397, 331 versus 317, 104 versus 88, and 9 versus 5 metagenome-assembled genomes (MAGs) were recovered from MEGAHIT assemblies of bioreactor sludge, surface water, lake sediment, and forest soil metagenomes, respectively. Noticeably, deduplication significantly reduced the computational costs of the metagenomic assembly, including the elapsed time (9.0% to 29.9%) and the maximum memory requirement (4.3% to 37.1%). Collectively, we recommend the removal of duplicate reads in metagenomes with high complexity before assembly and binning analyses, for example, the forest soil metagenomes examined in this study. Duplicated reads in shotgun metagenomes are usually considered technical artifacts. Their presence in metagenomes would theoretically not only introduce bias into the quantitative analysis but also result in mistakes in the coverage profile, leading to adverse effects on or even failures in metagenomic assembly and binning, as the widely used metagenome assemblers and binners all need coverage information for graph partitioning and assembly binning, respectively. However, this issue was seldom noticed, and its impacts on downstream essential bioinformatic procedures (e.g., assembly and binning) remained unclear. In this study, we comprehensively evaluated for the first time the implications of duplicate reads for the assembly and binning of real metagenomic data sets by comparing the assembly qualities, binning yields, and requirements for computational resources with and without the removal of duplicate reads. It was revealed that deduplication considerably increased the binning yields of metagenomes with high complexity and significantly reduced the computational costs, including the elapsed time and the maximum memory requirement, for most of the metagenomes studied. These results provide empirical references for more cost-efficient metagenomic analyses in microbiome research.

摘要

在过去十年中,宏基因组学极大地革新了微生物群落的研究。然而,主要由宏基因组DNA测序文库制备产生的人工重复 reads 及其对宏基因组组装和分箱的影响从未受到关注。在此,我们基于对五组具有不同微生物组复杂性的代表性宏基因组的分析,明确研究了重复 reads 对宏基因组组装和分箱的影响。我们的结果表明,由于宏基因组组装 contigs 的 contig 长度和覆盖度分析得到改善,去重显著提高了大多数所检测宏基因组数据集的分箱产量(提高了3.5%至80%),而对于低复杂性的宏基因组(如人类肠道宏基因组),去重略微降低了分箱产量。具体而言,分别从生物反应器污泥、地表水、湖泊沉积物和森林土壤宏基因组的MEGAHIT组装中回收了411个与397个、331个与317个、104个与88个以及9个与5个宏基因组组装基因组(MAGs)。值得注意的是,去重显著降低了宏基因组组装的计算成本,包括运行时间(9.0%至29.9%)和最大内存需求(4.3%至37.1%)。总体而言,我们建议在组装和分箱分析之前,去除高复杂性宏基因组中的重复 reads,例如本研究中检测的森林土壤宏基因组。鸟枪法宏基因组中的重复 reads 通常被视为技术假象。它们在宏基因组中的存在理论上不仅会给定量分析带来偏差,还会导致覆盖度分析出现错误,从而对宏基因组组装和分箱产生不利影响甚至导致失败,因为广泛使用的宏基因组组装器和分箱器分别需要覆盖度信息进行图划分和组装分箱。然而,这个问题很少被注意到,其对下游关键生物信息学程序(如组装和分箱)的影响仍不清楚。在本研究中,我们首次通过比较去除和不去除重复 reads 时的组装质量、分箱产量以及计算资源需求,全面评估了重复 reads 对真实宏基因组数据集组装和分箱的影响。结果表明,去重显著提高了高复杂性宏基因组的分箱产量,并显著降低了大多数所研究宏基因组的计算成本,包括运行时间和最大内存需求。这些结果为微生物组研究中更具成本效益的宏基因组分析提供了实证参考。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed42/10101064/6d7ad4ef4aa8/spectrum.04282-22-f001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验