Energy Mining and Environment Research Centre, National Research Council Canada, Montreal, QC, Canada H4P-2R2.
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac443.
In shotgun metagenomics (SM), the state-of-the-art bioinformatic workflows are referred to as high-resolution shotgun metagenomics (HRSM) and require intensive computing and disk storage resources. While the increase in data output of the latest iteration of high-throughput DNA sequencing systems can allow for unprecedented sequencing depth at a minimal cost, adjustments in HRSM workflows will be needed to properly process these ever-increasing sequence datasets. One potential adaptation is to generate so-called shallow SM datasets that contain fewer sequencing data per sample as compared with the more classic high coverage sequencing. While shallow sequencing is a promising avenue for SM data analysis, detailed benchmarks using real-data are lacking. In this case study, we took four public SM datasets, one massive and the others moderate in size and subsampled each dataset at various levels to mimic shallow sequencing datasets of various sequencing depths. Our results suggest that shallow SM sequencing is a viable avenue to obtain sound results regarding microbial community structures and that high-depth sequencing does not bring additional elements for ecological interpretation. More specifically, results obtained by subsampling as little as 0.5 M sequencing clusters per sample were similar to the results obtained with the largest subsampled dataset for human gut and agricultural soil datasets. For an Antarctic dataset, which contained only a few samples, 4 M sequencing clusters per sample was found to generate comparable results to the full dataset. One area where ultra-deep sequencing and maximizing the usage of all data was undeniably beneficial was in the generation of metagenome-assembled genomes.
在 shotgun 宏基因组学 (SM) 中,最先进的生物信息学工作流程被称为高分辨率 shotgun 宏基因组学 (HRSM),需要密集的计算和磁盘存储资源。虽然最新一代高通量 DNA 测序系统的数据输出量增加,可以以最小的成本实现前所未有的测序深度,但需要对 HRSM 工作流程进行调整,以正确处理这些不断增加的序列数据集。一种潜在的适应方法是生成所谓的浅层 SM 数据集,与更经典的高覆盖测序相比,每个样本的测序数据更少。虽然浅层测序是 SM 数据分析的一个有前途的途径,但缺乏使用真实数据的详细基准测试。在本案例研究中,我们采用了四个公共的 SM 数据集,一个是大规模的,另一个是中等规模的,并对每个数据集进行了不同水平的亚采样,以模拟各种测序深度的浅层测序数据集。我们的结果表明,浅层 SM 测序是获得微生物群落结构良好结果的可行途径,而高深度测序不会为生态解释带来额外的元素。更具体地说,从每个样本中仅亚采样 0.5M 测序簇获得的结果与从人类肠道和农业土壤数据集最大亚采样数据集获得的结果相似。对于一个仅包含少数样本的南极数据集,每个样本 4M 测序簇被发现可以生成与完整数据集相当的结果。在超深度测序和最大限度利用所有数据方面,一个不可否认的有益领域是生成宏基因组组装基因组。