Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen 72076, Germany.
Department of Computer Science, ETH Zürich, Zürich 8092, Switzerland.
Bioinformatics. 2020 May 1;36(10):3011-3017. doi: 10.1093/bioinformatics/btaa124.
Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies.
We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications.
DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects.
DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED.
Supplementary data are available at Bioinformatics online.
宏基因组组装方法学的进步使得发表的宏基因组组装数量迅速增加。然而,由于缺乏可作为伪真实数据的密切相关的参考基因组,因此识别错误组装是具有挑战性的。现有的无参考方法不再被维护,可能会做出不适用与各种研究项目的强烈假设,并且尚未在大规模宏基因组组装上进行验证。
我们提出了 DeepMAsED,这是一种无需参考基因组即可识别错误组装的深度学习方法。此外,我们提供了一个用于生成大规模、真实宏基因组组装的计算管道,以便进行全面的模型训练和测试。当应用于大型和复杂的宏基因组组装时,DeepMAsED 的准确性大大超过了现有技术。我们的模型估计在最近的两项大规模宏基因组组装出版物中,有 1%的基因组错误组装率。
DeepMAsED 无需参考基因组或强烈的建模假设,即可准确识别来自广泛的细菌和古菌的宏基因组组装中存在的错误组装。运行 DeepMAsED 非常简单,并且可以使用我们的数据集生成管道来重新训练模型。因此,DeepMAsED 是一种灵活的错误组装分类器,可以应用于广泛的宏基因组组装项目。
DeepMAsED 可在 GitHub 上获得,网址为 https://github.com/leylabmpi/DeepMAsED。
补充数据可在 Bioinformatics 在线获取。