Suppr超能文献

ganon:针对大型且最新的参考序列集进行精确的宏基因组分类。

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.

机构信息

Bioinformatics Unit (MF1), Robert Koch Institute, Berlin 13353, Germany.

CAPES Foundation, Ministry of Education of Brazil, Brasília 70040-020, Brazil.

出版信息

Bioinformatics. 2020 Jul 1;36(Suppl_1):i12-i20. doi: 10.1093/bioinformatics/btaa458.

Abstract

MOTIVATION

The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices.

RESULTS

Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification.

AVAILABILITY AND IMPLEMENTATION

The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

组装基因组序列的指数级增长极大地促进了宏基因组学研究。然而,目前可用的方法难以处理不断增加的序列数量及其频繁的更新。索引当前的 RefSeq 需要数天时间,并且在大型服务器上需要数百 GB 的内存。迄今为止,很少有方法解决这些问题,尽管许多方法从理论上可以处理大量的参考文献,但在实践中,时间/内存要求是不可行的。因此,许多需要序列分类的研究使用的是通常过时的、几乎从不真正最新的索引。

结果

受这些限制的启发,我们创建了 ganon,这是一种基于 k-mer 的读分类工具,它使用交错布隆过滤器结合分类聚类和 k-mer 计数/过滤方案。ganon 为索引参考文献提供了一种高效的方法,并保持其更新。它只需要 <55 分钟即可索引细菌、古菌、真菌和病毒的完整 RefSeq。该工具可以在创建索引所需时间的一小部分内保持这些索引的最新状态。ganon 使得查询非常大的参考集成为可能,因此它比类似的方法分类更多的读段并识别更多的物种。在对 RefSeq 完整基因组进行高复杂度 CAMI 挑战数据集分类时,ganon 与最先进的工具相比,具有更高的精度和相同的灵敏度。使用相同的数据集对完整的 RefSeq 进行分类时,ganon 在属水平上的 F1 得分提高了 65%。它支持分类学和组装水平的分类、多个索引和层次分类。

可用性和实现

该软件是开源的,可在以下网址获得:https://gitlab.com/rki_bioinformatics/ganon。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26dc/7355301/c5b003002dfb/btaa458f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验