Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Université Paris-Saclay, 2 rue Gaston Crémieux, 91057 Evry, France.
Laboratoire Sciences des Données et de la Décision, LIST, CEA, Bâtiment 565, 91191 Gif-sur-Yvette, France.
Gigascience. 2020 Apr 1;9(4). doi: 10.1093/gigascience/giaa028.
Sequence-binning techniques enable the recovery of an increasing number of genomes from complex microbial metagenomes and typically require prior metagenome assembly, incurring the computational cost and drawbacks of the latter, e.g., biases against low-abundance genomes and inability to conveniently assemble multi-terabyte datasets.
We present here a scalable pre-assembly binning scheme (i.e., operating on unassembled short reads) enabling latent genome recovery by leveraging sparse dictionary learning and elastic-net regularization, and its use to recover hundreds of metagenome-assembled genomes, including very low-abundance genomes, from a joint analysis of microbiomes from the LifeLines DEEP population cohort (n = 1,135, >1010 reads).
We showed that sparse coding techniques can be leveraged to carry out read-level binning at large scale and that, despite lower genome reconstruction yields compared to assembly-based approaches, bin-first strategies can complement the more widely used assembly-first protocols by targeting distinct genome segregation profiles. Read enrichment levels across 6 orders of magnitude in relative abundance were observed, indicating that the method has the power to recover genomes consistently segregating at low levels.
序列聚类技术能够从复杂的微生物宏基因组中恢复越来越多的基因组,通常需要预先进行宏基因组组装,这会带来后者的计算成本和缺点,例如对低丰度基因组的偏见以及无法方便地组装多 TB 数据集。
我们在这里提出了一种可扩展的预组装分箱方案(即在未组装的短读序列上运行),通过稀疏字典学习和弹性网络正则化来利用潜在的基因组恢复,并将其用于从 LifeLines DEEP 人群队列的微生物组联合分析中恢复数百个宏基因组组装的基因组,包括非常低丰度的基因组(n = 1,135,> 1010 个读数)。
我们表明稀疏编码技术可以被利用来进行大规模的读级分箱,并且尽管与基于组装的方法相比,基因组重建产量较低,但分箱优先策略可以通过针对不同的基因组分离分布来补充更广泛使用的组装优先协议。观察到相对丰度的 6 个数量级的读富集水平,表明该方法具有持续回收低水平分离基因组的能力。