利用主题建模来促进宏基因组读数分箱。

Exploiting topic modeling to boost metagenomic reads binning.

作者信息

Zhang Ruichang, Cheng Zhanzhan, Guan Jihong, Zhou Shuigeng

出版信息

BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S2. doi: 10.1186/1471-2105-16-S5-S2. Epub 2015 Mar 18.

DOI:10.1186/1471-2105-16-S5-S2

PMID:25859745

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4402587/

Abstract

BACKGROUND

With the rapid development of high-throughput technologies, researchers can sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these metagenomic reads into different species or taxonomical classes is a vital step for metagenomic analysis, which is referred to as binning of metagenomic data.

RESULTS

In this paper, we propose a new method TM-MCluster for binning metagenomic reads. First, we represent each metagenomic read as a set of "k-mers" with their frequencies occurring in the read. Then, we employ a probabilistic topic model -- the Latent Dirichlet Allocation (LDA) model to the reads, which generates a number of hidden "topics" such that each read can be represented by a distribution vector of the generated topics. Finally, as in the MCluster method, we apply SKWIC -- a variant of the classical K-means algorithm with automatic feature weighting mechanism to cluster these reads represented by topic distributions.

CONCLUSIONS

Experiments show that the new method TM-MCluster outperforms major existing methods, including AbundanceBin, MetaCluster 3.0/5.0 and MCluster. This result indicates that the exploitation of topic modeling can effectively improve the binning performance of metagenomic reads.

摘要

背景

随着高通量技术的快速发展，研究人员能够对直接从环境中采样得到的微生物群落的整个宏基因组进行测序。将这些宏基因组 reads 分配到不同物种或分类类别中是宏基因组分析的关键步骤，这被称为宏基因组数据的分箱。

结果

在本文中，我们提出了一种用于宏基因组 reads 分箱的新方法 TM-MCluster。首先，我们将每个宏基因组 read 表示为一组“k-mer”及其在 read 中出现的频率。然后，我们将概率主题模型——潜在狄利克雷分配（LDA）模型应用于这些 reads，该模型生成一些隐藏的“主题”，使得每个 read 可以由生成主题的分布向量表示。最后，如同在 MCluster 方法中一样，我们应用 SKWIC——一种具有自动特征加权机制的经典 K 均值算法的变体，对由主题分布表示的这些 reads 进行聚类。

结论

实验表明，新方法 TM-MCluster 优于主要现有的方法，包括 AbundanceBin、MetaCluster 3.0/5.0 和 MCluster。这一结果表明，主题建模的应用能够有效提高宏基因组 reads 的分箱性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/451b/4402587/574934a705b1/1471-2105-16-S5-S2-1.jpg

相似文献

Exploiting topic modeling to boost metagenomic reads binning.

BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S2. doi: 10.1186/1471-2105-16-S5-S2. Epub 2015 Mar 18.

A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting.

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):42-54. doi: 10.1109/TCBB.2013.137.

MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning.

BMC Genomics. 2014;15 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2164-15-S1-S12. Epub 2014 Jan 24.

Selection of marker genes for genetic barcoding of microorganisms and binning of metagenomic reads by Barcoder software tools.

BMC Bioinformatics. 2018 Aug 30;19(1):309. doi: 10.1186/s12859-018-2320-1.

Metagenome Assembly and Contig Assignment.

Methods Mol Biol. 2018;1849:179-192. doi: 10.1007/978-1-4939-8728-3_12.

AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides.

BMC Bioinformatics. 2024 Jul 16;25(1):241. doi: 10.1186/s12859-024-05859-7.

MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures.

Bioinformatics. 2016 Sep 1;32(17):i567-i575. doi: 10.1093/bioinformatics/btw466.

COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge.

Bioinformatics. 2017 Mar 15;33(6):791-798. doi: 10.1093/bioinformatics/btw290.

MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species.

J Comput Biol. 2012 Feb;19(2):241-9. doi: 10.1089/cmb.2011.0276.

Genome-resolved metagenomics using environmental and clinical samples.

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab030.

引用本文的文献

Decontaminating eukaryotic genome assemblies with machine learning.

BMC Bioinformatics. 2017 Dec 1;18(1):533. doi: 10.1186/s12859-017-1941-0.

A new method for enhancer prediction based on deep belief network.

BMC Bioinformatics. 2017 Oct 16;18(Suppl 12):418. doi: 10.1186/s12859-017-1828-0.

MetaTopics: an integration tool to analyze microbial community profile by topic model.

BMC Genomics. 2017 Jan 25;18(Suppl 1):962. doi: 10.1186/s12864-016-3257-2.

An overview of topic modeling and its current applications in bioinformatics.

Springerplus. 2016 Sep 20;5(1):1608. doi: 10.1186/s40064-016-3252-8. eCollection 2016.

A novel procedure on next generation sequencing data analysis using text mining algorithm.

BMC Bioinformatics. 2016 May 13;17(1):213. doi: 10.1186/s12859-016-1075-9.

本文引用的文献

A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting.

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):42-54. doi: 10.1109/TCBB.2013.137.

MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning.

BMC Genomics. 2014;15 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2164-15-S1-S12. Epub 2014 Jan 24.

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample.

Bioinformatics. 2012 Sep 15;28(18):i356-i362. doi: 10.1093/bioinformatics/bts397.

MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species.

J Comput Biol. 2012 Feb;19(2):241-9. doi: 10.1089/cmb.2011.0276.

Exploiting the functional and taxonomic structure of genomic data by probabilistic topic modeling.

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul-Aug;9(4):980-91. doi: 10.1109/TCBB.2011.113.

A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio.

Bioinformatics. 2011 Jun 1;27(11):1489-95. doi: 10.1093/bioinformatics/btr186. Epub 2011 Apr 14.

A novel abundance-based algorithm for binning metagenomic sequences using l-tuples.

J Comput Biol. 2011 Mar;18(3):523-34. doi: 10.1089/cmb.2010.0245.

MLTreeMap--accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies.

BMC Genomics. 2010 Aug 5;11:461. doi: 10.1186/1471-2164-11-461.

A human gut microbial gene catalogue established by metagenomic sequencing.

Nature. 2010 Mar 4;464(7285):59-65. doi: 10.1038/nature08821.

Predicting protein-protein relationships from literature using latent topics.

Genome Inform. 2009 Oct;23(1):3-12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用主题建模来促进宏基因组读数分箱。

Exploiting topic modeling to boost metagenomic reads binning.

作者信息

Zhang Ruichang, Cheng Zhanzhan, Guan Jihong, Zhou Shuigeng