MegaGTA：一种使用迭代德布鲁因图的灵敏且准确的宏基因组基因靶向组装器。

MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs.

作者信息

Li Dinghua, Huang Yukun, Leung Chi-Ming, Luo Ruibang, Ting Hing-Fung, Lam Tak-Wah

机构信息

Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.

L3 Bioinformatics Limited, Western District, Hong Kong.

出版信息

BMC Bioinformatics. 2017 Oct 16;18(Suppl 12):408. doi: 10.1186/s12859-017-1825-3.

DOI:10.1186/s12859-017-1825-3

PMID:29072142

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5657035/

Abstract

BACKGROUND

The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers.

RESULTS

In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7-19.3% more contigs than Xander, and these contigs were assigned to 10-25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander.

CONCLUSION

MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta .

摘要

背景

基因靶向宏基因组组装工具Xander的最新发布表明，使用经过训练的隐马尔可夫模型（HMM）来指导德布鲁因图的遍历比其他组装方法具有明显优势。作为一项初步研究，Xander确实有很大的改进空间。除了速度慢之外，Xander在构建图时仅使用1个k-mer大小，而无论选择何种k值都会在灵敏度或准确性上有所折衷。Xander使用布隆过滤器来表示德布鲁因图以降低内存占用。布隆过滤器会引入误报，并且尚不清楚这将如何影响组装质量。Xander没有跟踪k-mer的多重性，而这本来是区分错误k-mer和正确k-mer的有效方法。

结果

在本文中，我们提出了一种新的基因靶向组装工具MegaGTA，它试图在不同方面改进Xander。在质量方面，它利用迭代德布鲁因图充分利用多个k-mer大小，以兼顾灵敏度和准确性。在计算方面，它采用简洁德布鲁因图（SdBG）来实现低内存占用和高速度（后者受益于用于构建SdBG的高效并行算法）。与布隆过滤器不同，SdBG是德布鲁因图的精确表示。这使得MegaGTA能够避免产生误报重叠群，并能轻松纳入k-mer的多重性以构建更好的HMM模型。我们在一个由人类微生物组计划（HMP）定义的模拟宏基因组数据集上对MegaGTA和Xander进行了比较，结果表明MegaGTA在灵敏度和准确性方面均表现出色。在一个大型根际土壤宏基因组样本（327Gbp）上，MegaGTA产生的重叠群比Xander多9.7 - 19.3%，并且这些重叠群被分配到的基因参考多10 - 25%。在我们的实验中，根据所使用的k-mer数量，MegaGTA比Xander快两到十倍。

结论

MegaGTA改进了Xander的算法，实现了更高的灵敏度、准确性和速度。此外，它能够从超大型宏基因组数据集中组装基因序列。其源代码可在https://github.com/HKU - BAL/megagta免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4d30/5657035/32d852088445/12859_2017_1825_Fig1_HTML.jpg

相似文献

MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs.

BMC Bioinformatics. 2017 Oct 16;18(Suppl 12):408. doi: 10.1186/s12859-017-1825-3.

Xander: employing a novel method for efficient gene-targeted metagenomic assembly.

Microbiome. 2015 Aug 5;3:32. doi: 10.1186/s40168-015-0093-6. eCollection 2015.

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices.

Methods. 2016 Jun 1;102:3-11. doi: 10.1016/j.ymeth.2016.02.020. Epub 2016 Mar 21.

Review, Evaluation, and Directions for Gene-Targeted Assembly for Ecological Analyses of Metagenomes.

Front Genet. 2019 Oct 15;10:957. doi: 10.3389/fgene.2019.00957. eCollection 2019.

Evaluation of short read metagenomic assembly.

BMC Genomics. 2011;12 Suppl 2(Suppl 2):S8. doi: 10.1186/1471-2164-12-S2-S8. Epub 2011 Jul 27.

Inference of viral quasispecies with a paired de Bruijn graph.

Bioinformatics. 2021 May 1;37(4):473-481. doi: 10.1093/bioinformatics/btaa782.

Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.

BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.

RResolver: efficient short-read repeat resolution within ABySS.

BMC Bioinformatics. 2022 Jun 21;23(1):246. doi: 10.1186/s12859-022-04790-z.

A space and time-efficient index for the compacted colored de Bruijn graph.

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

GraphBin: refined binning of metagenomic contigs using assembly graphs.

Bioinformatics. 2020 Jun 1;36(11):3307-3313. doi: 10.1093/bioinformatics/btaa180.

引用本文的文献

Efficient De Novo Assembly and Recovery of Microbial Genomes from Complex Metagenomes Using a Reduced Set of k-mers.

Interdiscip Sci. 2025 Jun 2. doi: 10.1007/s12539-025-00722-6.

Diversity and transcription of genes involved in respiratory As(V) reduction and As(III) methylation in Japanese paddy soils.

BMC Microbiol. 2024 Oct 9;24(1):396. doi: 10.1186/s12866-024-03562-4.

Applications of de Bruijn graphs in microbiome research.

Imeta. 2022 Mar 1;1(1):e4. doi: 10.1002/imt2.4. eCollection 2022 Mar.

kakapo: easy extraction and annotation of genes from raw RNA-seq reads.

PeerJ. 2023 Nov 27;11:e16456. doi: 10.7717/peerj.16456. eCollection 2023.

Genome-Guided Analysis of Seven Weed Species Reveals Conserved Sequence and Structural Features of Key Gene Targets for Herbicide Development.

Front Plant Sci. 2022 Jun 29;13:909073. doi: 10.3389/fpls.2022.909073. eCollection 2022.

Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Funct Integr Genomics. 2022 Feb;22(1):3-26. doi: 10.1007/s10142-021-00810-y. Epub 2021 Oct 18.

Genome-resolved metagenomics using environmental and clinical samples.

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab030.

ContigExtender: a new approach to improving de novo sequence assembly for viral metagenomics data.

BMC Bioinformatics. 2021 Mar 12;22(1):119. doi: 10.1186/s12859-021-04038-2.

Review, Evaluation, and Directions for Gene-Targeted Assembly for Ecological Analyses of Metagenomes.

Front Genet. 2019 Oct 15;10:957. doi: 10.3389/fgene.2019.00957. eCollection 2019.

New approaches for metagenome assembly with short reads.

Brief Bioinform. 2020 Mar 23;21(2):584-594. doi: 10.1093/bib/bbz020.

本文引用的文献

MetaQUAST: evaluation of metagenome assemblies.

Bioinformatics. 2016 Apr 1;32(7):1088-90. doi: 10.1093/bioinformatics/btv697. Epub 2015 Nov 26.

Xander: employing a novel method for efficient gene-targeted metagenomic assembly.

Microbiome. 2015 Aug 5;3:32. doi: 10.1186/s40168-015-0093-6. eCollection 2015.

Reconstructing 16S rRNA genes in metagenomic data.

Bioinformatics. 2015 Jun 15;31(12):i35-43. doi: 10.1093/bioinformatics/btv231.

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

Bioinformatics. 2015 May 15;31(10):1674-6. doi: 10.1093/bioinformatics/btv033. Epub 2015 Jan 20.

A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.

PLoS Comput Biol. 2014 Aug 14;10(8):e1003737. doi: 10.1371/journal.pcbi.1003737. eCollection 2014 Aug.

Trimmomatic: a flexible trimmer for Illumina sequence data.

Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1.

Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted metagenomics using FrameBot, a new informatics tool.

mBio. 2013 Sep 17;4(5):e00592-13. doi: 10.1128/mBio.00592-13.

Space-efficient and exact de Bruijn graph representation based on a Bloom filter.

Algorithms Mol Biol. 2013 Sep 16;8(1):22. doi: 10.1186/1748-7188-8-22.

Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions.

Nucleic Acids Res. 2013 Jul;41(12):e121. doi: 10.1093/nar/gkt263. Epub 2013 Apr 17.

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Gigascience. 2012 Dec 27;1(1):18. doi: 10.1186/2047-217X-1-18.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

MegaGTA：一种使用迭代德布鲁因图的灵敏且准确的宏基因组基因靶向组装器。

MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献