一种新的数据结构，用于支持基于 k-mer 特征的宏基因组序列的超快速分类学分类。

A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures.

机构信息

Department of Computer Science, University of Kentucky, Lexington, KY, USA.

Department of Computer Science,University of Kentucky, Lexington, KY, USA.

出版信息

Bioinformatics. 2018 Jan 1;34(1):171-178. doi: 10.1093/bioinformatics/btx432.

DOI:10.1093/bioinformatics/btx432

PMID:29036588

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5870563/

Abstract

MOTIVATION

Metagenomic read classification is a critical step in the identification and quantification of microbial species sampled by high-throughput sequencing. Although many algorithms have been developed to date, they suffer significant memory and/or computational costs. Due to the growing popularity of metagenomic data in both basic science and clinical applications, as well as the increasing volume of data being generated, efficient and accurate algorithms are in high demand.

RESULTS

We introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequencing reads. The algorithm employs a novel data structure, called l-Othello, to support efficient querying of a taxon using its k-mer signatures. MetaOthello is an order-of-magnitude faster than the current state-of-the-art algorithms Kraken and Clark, and requires only one-third of the RAM. In comparison to Kaiju, a metagenomic classification tool using protein sequences instead of genomic sequences, MetaOthello is three times faster and exhibits 20-30% higher classification sensitivity. We report comparative analyses of both scalability and accuracy using a number of simulated and empirical datasets.

AVAILABILITY AND IMPLEMENTATION

MetaOthello is a stand-alone program implemented in C ++. The current version (1.0) is accessible via https://doi.org/10.5281/zenodo.808941.

CONTACT

liuj@cs.uky.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

宏基因组测序读分类是识别和量化高通量测序采样微生物物种的关键步骤。尽管迄今为止已经开发了许多算法，但它们存在显著的内存和/或计算成本。由于宏基因组数据在基础科学和临床应用中的日益普及，以及生成的数据量不断增加，高效和准确的算法需求量很大。

结果

我们引入了 MetaOthello，这是一种用于宏基因组测序读的概率哈希分类器。该算法采用了一种新的数据结构，称为 l-Othello，以支持使用其 k-mer 特征对分类单元进行高效查询。MetaOthello 比当前最先进的算法 Kraken 和 Clark 快一个数量级，仅需其三分之一的 RAM。与使用蛋白质序列而不是基因组序列的宏基因组分类工具 Kaiju 相比，MetaOthello 的速度快三倍，并且表现出 20-30%的更高分类灵敏度。我们报告了使用一些模拟和经验数据集进行的可扩展性和准确性的比较分析。

可用性和实现

MetaOthello 是一个用 C ⁇ 编写的独立程序。当前版本（1.0）可通过 https://doi.org/10.5281/zenodo.808941 访问。

联系人

liuj@cs.uky.edu。

补充信息

补充数据可在生物信息学在线获得。

相似文献

A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures.一种新的数据结构，用于支持基于 k-mer 特征的宏基因组序列的超快速分类学分类。

Bioinformatics. 2018 Jan 1;34(1):171-178. doi: 10.1093/bioinformatics/btx432.

MetaShot: an accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data.MetaShot：一种从鸟枪法宏基因组数据中对宿主相关微生物群进行分类单元分类的精确工作流程。

Bioinformatics. 2017 Jun 1;33(11):1730-1732. doi: 10.1093/bioinformatics/btx036.

MetaCache: context-aware classification of metagenomic reads using minhashing.MetaCache：基于 minhashing 的宏基因组读段上下文感知分类。

Bioinformatics. 2017 Dec 1;33(23):3740-3748. doi: 10.1093/bioinformatics/btx520.

Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters.基于分层交错异或过滤器的长读快速且节省空间的分类学分类。

Genome Res. 2024 Jul 23;34(6):914-924. doi: 10.1101/gr.278623.123.

COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge.可口可乐：利用序列组成、读段覆盖度、共比对和双端读段连接对宏基因组重叠群进行分箱。

Bioinformatics. 2017 Mar 15;33(6):791-798. doi: 10.1093/bioinformatics/btw290.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations.可扩展宏基因组比对研究工具（SMART）：一种用于对复杂序列群体中的宏基因组序列进行分类的可扩展、快速且完整的搜索启发式方法。

BMC Bioinformatics. 2016 Jul 28;17:292. doi: 10.1186/s12859-016-1159-6.

RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets.RIEMS：一种用于对宏基因组学数据集的 reads 进行灵敏且全面的分类学分类的软件流程。

BMC Bioinformatics. 2015 Mar 3;16(1):69. doi: 10.1186/s12859-015-0503-6.

AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides.AﬁTbin：一种基于初始和末端核苷酸的基于聚合 l-mer 频率的宏基因组序列拼接方法。

BMC Bioinformatics. 2024 Jul 16;25(1):241. doi: 10.1186/s12859-024-05859-7.

MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures.MetaProb：基于概率序列特征的准确宏基因组 reads 分箱

Bioinformatics. 2016 Sep 1;32(17):i567-i575. doi: 10.1093/bioinformatics/btw466.

引用本文的文献

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。

Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae102.

Exercise and microbiome: From big data to therapy.运动与微生物组：从大数据到治疗。

Comput Struct Biotechnol J. 2023 Oct 19;21:5434-5445. doi: 10.1016/j.csbj.2023.10.034. eCollection 2023.

cgMSI: pathogen detection within species from nanopore metagenomic sequencing data.cgMSI：从纳米孔宏基因组测序数据中检测种内病原体。

BMC Bioinformatics. 2023 Oct 12;24(1):387. doi: 10.1186/s12859-023-05512-9.

Nanopore sequencing of a monkeypox virus strain isolated from a pustular lesion in the Central African Republic.从中非共和国脓疱病变中分离的猴痘病毒株的纳米孔测序。

Sci Rep. 2022 Jun 24;12(1):10768. doi: 10.1038/s41598-022-15073-1.

SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning.SeqScreen：通过集成学习进行准确且敏感的致病性序列功能筛选。

Genome Biol. 2022 Jun 20;23(1):133. doi: 10.1186/s13059-022-02695-x.

Fast and accurate metagenotyping of the human gut microbiome with GT-Pro.使用GT-Pro对人类肠道微生物群进行快速准确的宏基因分型。

Nat Biotechnol. 2022 Apr;40(4):507-516. doi: 10.1038/s41587-021-01102-3. Epub 2021 Dec 23.

Application of Deep Learning in Plant-Microbiota Association Analysis.深度学习在植物-微生物群关联分析中的应用。

Front Genet. 2021 Oct 8;12:697090. doi: 10.3389/fgene.2021.697090. eCollection 2021.

Orchestrating an Optimized Next-Generation Sequencing-Based Cloud Workflow for Robust Viral Identification during Pandemics.编排基于优化的下一代测序的云工作流程，以在大流行期间实现可靠的病毒识别。

Biology (Basel). 2021 Oct 11;10(10):1023. doi: 10.3390/biology10101023.

Fast and Accurate Classification of Meta-Genomics Long Reads With deSAMBA.使用deSAMBA对宏基因组长读段进行快速准确分类

Front Cell Dev Biol. 2021 Apr 28;9:643645. doi: 10.3389/fcell.2021.643645. eCollection 2021.

Specific Microbial Taxa and Functional Capacity Contribute to Chicken Abdominal Fat Deposition.特定微生物类群和功能能力对鸡腹部脂肪沉积有影响。

Front Microbiol. 2021 Mar 17;12:643025. doi: 10.3389/fmicb.2021.643025. eCollection 2021.

本文引用的文献

Centrifuge: rapid and sensitive classification of metagenomic sequences.离心机：宏基因组序列的快速灵敏分类

Genome Res. 2016 Dec;26(12):1721-1729. doi: 10.1101/gr.210641.116. Epub 2016 Oct 17.

Higher classification sensitivity of short metagenomic reads with CLARK-S.使用CLARK-S时短宏基因组读数具有更高的分类敏感性。

Bioinformatics. 2016 Dec 15;32(24):3823-3825. doi: 10.1093/bioinformatics/btw542. Epub 2016 Aug 18.

Fast and sensitive taxonomic classification for metagenomics with Kaiju.使用Kaiju对宏基因组学进行快速且灵敏的分类学分类。

Nat Commun. 2016 Apr 13;7:11257. doi: 10.1038/ncomms11257.

An evaluation of the accuracy and speed of metagenome analysis tools.宏基因组分析工具的准确性和速度评估。

Sci Rep. 2016 Jan 18;6:19233. doi: 10.1038/srep19233.

CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.克拉克：使用判别性k-mer对宏基因组和基因组序列进行快速准确分类

BMC Genomics. 2015 Mar 25;16(1):236. doi: 10.1186/s12864-015-1419-2.

Accurate read-based metagenome characterization using a hierarchical suite of unique signatures.使用分层独特特征套件进行基于读取的准确宏基因组表征。

Nucleic Acids Res. 2015 May 26;43(10):e69. doi: 10.1093/nar/gkv180. Epub 2015 Mar 12.

Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods.Taxator-tk：通过快速近似进化邻域对宏基因组进行精确的分类学归属

Bioinformatics. 2015 Mar 15;31(6):817-24. doi: 10.1093/bioinformatics/btu745. Epub 2014 Nov 10.

Kraken: ultrafast metagenomic sequence classification using exact alignments.克拉肯：使用精确比对的超快速宏基因组序列分类

Genome Biol. 2014 Mar 3;15(3):R46. doi: 10.1186/gb-2014-15-3-r46.

Strain/species identification in metagenomes using genome-specific markers.基于基因组特异标记的宏基因组中菌株/种的鉴定。

Nucleic Acids Res. 2014 Apr;42(8):e67. doi: 10.1093/nar/gku138. Epub 2014 Feb 12.

Metagenomic species profiling using universal phylogenetic marker genes.基于通用系统发育标记基因的宏基因组物种分析。

Nat Methods. 2013 Dec;10(12):1196-9. doi: 10.1038/nmeth.2693. Epub 2013 Oct 20.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验