用于宏基因组序列分类的高阶马尔可夫模型。

Higher-order Markov models for metagenomic sequence classification.

机构信息

Department of Biological Sciences and BioDiscovery Institute.

Department of Mathematics, University of North Texas, Denton, TX 76203, USA.

出版信息

Bioinformatics. 2020 Aug 15;36(14):4130-4136. doi: 10.1093/bioinformatics/btaa562.

DOI:10.1093/bioinformatics/btaa562

PMID:32516355

Abstract

MOTIVATION

Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences.

RESULTS

Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences.

AVAILABILITY AND IMPLEMENTATION

The software has been made available at https://github.com/djburks/SMM.

CONTACT

Rajeev.Azad@unt.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

基于代表参考基因组序列的 k-mer 分布的无比对、随机模型在 DNA 序列分类方面有着丰富的历史。特别是，马尔可夫模型的变体以前被广泛使用。高阶马尔可夫模型的使用一直很谨慎，也许很少使用，主要是因为缺乏足够的训练数据和计算能力。测序技术和计算的进步使得高阶模型的预测能力得以发挥。因此，我们重新审视了高阶马尔可夫模型，并评估了它们在分类宏基因组序列方面的性能。

结果

在使用测序原核基因组构建的宏基因组数据集上，对高阶马尔可夫模型（HOM，9 阶或更高阶）与内插马尔可夫模型、内插上下文模型和低阶模型（8 阶或更低阶）进行了比较评估。我们的结果表明，在所有分类等级上，HOM 都比其他模型在分类短至 100nt 的宏基因组片段表现更好，在片段长度增加到 250nt 时，在较低的分类等级上表现更好。HOM 也被发现比广泛用于宏基因组序列分类的局部比对更准确。用 C++编写的新软件实现比现有的基于马尔可夫的宏基因组分类器的分类速度更快，因此可以作为独立的分类器使用，也可以与现有的分类器结合使用，以更稳健地分类宏基因组序列。