ganon：针对大型且最新的参考序列集进行精确的宏基因组分类。

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.

机构信息

Bioinformatics Unit (MF1), Robert Koch Institute, Berlin 13353, Germany.

CAPES Foundation, Ministry of Education of Brazil, Brasília 70040-020, Brazil.

出版信息

Bioinformatics. 2020 Jul 1;36(Suppl_1):i12-i20. doi: 10.1093/bioinformatics/btaa458.

DOI:10.1093/bioinformatics/btaa458

PMID:32657362

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7355301/

Abstract

MOTIVATION

The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices.

RESULTS

Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification.

AVAILABILITY AND IMPLEMENTATION

The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

组装基因组序列的指数级增长极大地促进了宏基因组学研究。然而，目前可用的方法难以处理不断增加的序列数量及其频繁的更新。索引当前的 RefSeq 需要数天时间，并且在大型服务器上需要数百 GB 的内存。迄今为止，很少有方法解决这些问题，尽管许多方法从理论上可以处理大量的参考文献，但在实践中，时间/内存要求是不可行的。因此，许多需要序列分类的研究使用的是通常过时的、几乎从不真正最新的索引。

结果

受这些限制的启发，我们创建了 ganon，这是一种基于 k-mer 的读分类工具，它使用交错布隆过滤器结合分类聚类和 k-mer 计数/过滤方案。ganon 为索引参考文献提供了一种高效的方法，并保持其更新。它只需要 <55 分钟即可索引细菌、古菌、真菌和病毒的完整 RefSeq。该工具可以在创建索引所需时间的一小部分内保持这些索引的最新状态。ganon 使得查询非常大的参考集成为可能，因此它比类似的方法分类更多的读段并识别更多的物种。在对 RefSeq 完整基因组进行高复杂度 CAMI 挑战数据集分类时，ganon 与最先进的工具相比，具有更高的精度和相同的灵敏度。使用相同的数据集对完整的 RefSeq 进行分类时，ganon 在属水平上的 F1 得分提高了 65%。它支持分类学和组装水平的分类、多个索引和层次分类。

可用性和实现

该软件是开源的，可在以下网址获得：https://gitlab.com/rki_bioinformatics/ganon。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26dc/7355301/c5b003002dfb/btaa458f1.jpg

相似文献

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.ganon：针对大型且最新的参考序列集进行精确的宏基因组分类。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i12-i20. doi: 10.1093/bioinformatics/btaa458.

Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters.基于分层交错异或过滤器的长读快速且节省空间的分类学分类。

Genome Res. 2024 Jul 23;34(6):914-924. doi: 10.1101/gr.278623.123.

MetaCache: context-aware classification of metagenomic reads using minhashing.MetaCache：基于 minhashing 的宏基因组读段上下文感知分类。

Bioinformatics. 2017 Dec 1;33(23):3740-3748. doi: 10.1093/bioinformatics/btx520.

A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

Large-scale machine learning for metagenomics sequence classification.用于宏基因组学序列分类的大规模机器学习

Bioinformatics. 2016 Apr 1;32(7):1023-32. doi: 10.1093/bioinformatics/btv683. Epub 2015 Nov 20.

Centrifuge: rapid and sensitive classification of metagenomic sequences.离心机：宏基因组序列的快速灵敏分类

Genome Res. 2016 Dec;26(12):1721-1729. doi: 10.1101/gr.210641.116. Epub 2016 Oct 17.

SimkaMin: fast and resource frugal de novo comparative metagenomics.SimkaMin：快速且资源节约型从头生物群落比较基因组学。

Bioinformatics. 2020 Feb 15;36(4):1275-1276. doi: 10.1093/bioinformatics/btz685.

MetaBCC-LR: metagenomics binning by coverage and composition for long reads.MetaBCC-LR：基于覆盖度和组成的长读长宏基因组 bin 划分。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i3-i11. doi: 10.1093/bioinformatics/btaa441.

DREAM-Yara: an exact read mapper for very large databases with short update time.DREAM-Yara：适用于具有较短更新时间的大型数据库的精确读取映射器。

Bioinformatics. 2018 Sep 1;34(17):i766-i772. doi: 10.1093/bioinformatics/bty567.

Large scale microbiome profiling in the cloud.大规模微生物组在云端的分析。

Bioinformatics. 2019 Jul 15;35(14):i13-i22. doi: 10.1093/bioinformatics/btz356.

引用本文的文献

Colibactin genes are highly prevalent in the developing infant gut microbiome.大肠杆菌素基因在发育中的婴儿肠道微生物群中高度普遍。

medRxiv. 2025 Aug 13:2025.08.12.25333511. doi: 10.1101/2025.08.12.25333511.

ganon2: up-to-date and scalable metagenomics analysis.Ganon2：最新且可扩展的宏基因组学分析。

NAR Genom Bioinform. 2025 Jul 17;7(3):lqaf094. doi: 10.1093/nargab/lqaf094. eCollection 2025 Sep.

CAMI Benchmarking Portal: online evaluation and ranking of metagenomic software.CAMI基准测试门户：宏基因组软件的在线评估与排名

Nucleic Acids Res. 2025 Jul 7;53(W1):W102-W109. doi: 10.1093/nar/gkaf369.

Alignment-free viral sequence classification at scale.大规模无比对病毒序列分类

BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5.

A proposed workflow to robustly analyze bacterial transcripts in RNAseq data from extracellular vesicles.一种用于稳健分析细胞外囊泡RNAseq数据中细菌转录本的拟议工作流程。

Front Microbiol. 2025 Mar 20;16:1486661. doi: 10.3389/fmicb.2025.1486661. eCollection 2025.

Addressing the dynamic nature of reference data: a new nucleotide database for robust metagenomic classification.应对参考数据的动态特性：一个用于可靠宏基因组分类的新核苷酸数据库。

mSystems. 2025 Apr 22;10(4):e0123924. doi: 10.1128/msystems.01239-24. Epub 2025 Mar 20.

MNBC: a multithreaded Minimizer-based Naïve Bayes Classifier for improved metagenomic sequence classification.MNBC：一种基于多线程 Minimizer 的朴素贝叶斯分类器，用于改进宏基因组序列分类。

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae601.

Rapid species-level metagenome profiling and containment estimation with sylph.利用Sylph进行快速的物种水平宏基因组分析和含量估计。

Nat Biotechnol. 2024 Oct 8. doi: 10.1038/s41587-024-02412-y.

Sequencing-based analysis of microbiomes.基于测序的微生物组分析。

Nat Rev Genet. 2024 Dec;25(12):829-845. doi: 10.1038/s41576-024-00746-6. Epub 2024 Jun 25.

Unveiling the microbial realm with VEBA 2.0: a modular bioinformatics suite for end-to-end genome-resolved prokaryotic, (micro)eukaryotic and viral multi-omics from either short- or long-read sequencing.揭示微生物世界的 VEBA 2.0：一个用于从短读或长读测序中进行端到端基因组解析的原核生物、（微）真核生物和病毒多组学的模块化生物信息学套件。

Nucleic Acids Res. 2024 Aug 12;52(14):e63. doi: 10.1093/nar/gkae528.

本文引用的文献

Improved metagenomic analysis with Kraken 2.Kraken 2 提升宏基因组分析。

Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0.

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts.KrakenUniq：基于独特的 k-mer 计数实现自信且快速的宏基因组分类。

Genome Biol. 2018 Nov 16;19(1):198. doi: 10.1186/s13059-018-1568-0.

DREAM-Yara: an exact read mapper for very large databases with short update time.DREAM-Yara：适用于具有较短更新时间的大型数据库的精确读取映射器。

Bioinformatics. 2018 Sep 1;34(17):i766-i772. doi: 10.1093/bioinformatics/bty567.

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification.RefSeq 数据库的增长影响了基于 k-mer 的最低共同祖先物种鉴定的准确性。

Genome Biol. 2018 Oct 30;19(1):165. doi: 10.1186/s13059-018-1554-6.

When old metagenomic data meet newly sequenced genomes, a case study.当古老的宏基因组数据遇到新测序的基因组时：一个案例研究。

PLoS One. 2018 Jun 14;13(6):e0198773. doi: 10.1371/journal.pone.0198773. eCollection 2018.

AMBER: Assessment of Metagenome BinnERs.AMBER：宏基因组 BinNERs 评估。

Gigascience. 2018 Jun 1;7(6). doi: 10.1093/gigascience/giy069.

LiveKraken--real-time metagenomic classification of illumina data.LiveKraken--实时宏基因组 illumina 数据分析分类。

Bioinformatics. 2018 Nov 1;34(21):3750-3752. doi: 10.1093/bioinformatics/bty433.

The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans.重建来自全球海洋的 2631 个宏基因组组装基因组。

Sci Data. 2018 Jan 16;5:170203. doi: 10.1038/sdata.2017.203.

GenBank.GenBank。

Nucleic Acids Res. 2018 Jan 4;46(D1):D41-D47. doi: 10.1093/nar/gkx1094.

RefSeq: an update on prokaryotic genome annotation and curation.RefSeq：原核生物基因组注释和管理的最新进展。

Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860. doi: 10.1093/nar/gkx1068.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

ganon：针对大型且最新的参考序列集进行精确的宏基因组分类。

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献