基于内插马尔可夫模型的宏基因组序列聚类。

Clustering metagenomic sequences with interpolated Markov models.

机构信息

Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, College Park, MD 20742, USA.

出版信息

BMC Bioinformatics. 2010 Nov 2;11:544. doi: 10.1186/1471-2105-11-544.

DOI:10.1186/1471-2105-11-544

PMID:21044341

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3098094/

Abstract

BACKGROUND

Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.

RESULTS

We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available.

CONCLUSIONS

SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.

摘要

背景

环境 DNA 测序（通常称为宏基因组学）具有揭示大量无法通过传统方法培养和测序的未知微生物的巨大潜力。由于宏基因组测序的输出是一组未知来源的大量读取序列，因此将来自同一物种的测序读取序列聚类在一起是至关重要的分析步骤。许多有效的方法依赖于公共数据库中的测序基因组，但这些基因组是一个高度偏向的样本，不一定能代表许多宏基因组学项目感兴趣的环境。

结果

我们提出了 SCIMM（基于插值马尔可夫模型的序列聚类），这是一种无监督的序列聚类方法。SCIMM 实现了比以前的无监督方法更高的聚类准确性。我们研究了无监督学习在复杂数据集上的局限性，并提出了一种 SCIMM 和监督学习方法 Phymm 的混合方法 PHYSCIMM，当有进化上接近的训练基因组时，它的性能更好。

结论

SCIMM 和 PHYSCIMM 是高度准确的宏基因组序列聚类方法。SCIMM 完全无监督，非常适合主要包含新型微生物的环境。PHYSCIMM 使用监督学习来提高在包含特征明确属的微生物菌株的环境中的聚类效果。SCIMM 和 PHYSCIMM 可从 http://www.cbcb.umd.edu/software/scimm 获得开源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba4a/3098094/b53bb8af963f/1471-2105-11-544-1.jpg

相似文献

Clustering metagenomic sequences with interpolated Markov models.基于内插马尔可夫模型的宏基因组序列聚类。

BMC Bioinformatics. 2010 Nov 2;11:544. doi: 10.1186/1471-2105-11-544.

MBMC: An Effective Markov Chain Approach for Binning Metagenomic Reads from Environmental Shotgun Sequencing Projects.MBMC：一种用于对环境鸟枪法测序项目中的宏基因组读数进行分箱的有效马尔可夫链方法。

OMICS. 2016 Aug;20(8):470-9. doi: 10.1089/omi.2016.0081. Epub 2016 Jul 22.

Binning Metagenomic Contigs Using Unsupervised Clustering and Reference Databases.使用无监督聚类和参考数据库对宏基因组重叠群进行分箱

Interdiscip Sci. 2022 Dec;14(4):795-803. doi: 10.1007/s12539-022-00526-y. Epub 2022 May 31.

A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting.一种基于N元语法和自动特征加权的宏基因组序列无监督分箱新方法。

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):42-54. doi: 10.1109/TCBB.2013.137.

MBBC: an efficient approach for metagenomic binning based on clustering.MBBC：一种基于聚类的宏基因组分箱高效方法。

BMC Bioinformatics. 2015 Feb 5;16:36. doi: 10.1186/s12859-015-0473-8.

Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets.评估宏基因组工具在真实宏基因组数据集和 CAMI 数据集上的基因组 binning 效果。

BMC Bioinformatics. 2020 Jul 28;21(1):334. doi: 10.1186/s12859-020-03667-3.

MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets.MetaCAA：一种用于宏基因组数据集高效组装的聚类辅助方法。

Genomics. 2014 Feb-Mar;103(2-3):161-8. doi: 10.1016/j.ygeno.2014.02.007. Epub 2014 Mar 5.

Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models.Phymm和PhymmBL：基于插值马尔可夫模型的宏基因组系统发育分类

Nat Methods. 2009 Sep;6(9):673-6. doi: 10.1038/nmeth.1358. Epub 2009 Aug 2.

MetaCRS: unsupervised clustering of contigs with the recursive strategy of reducing metagenomic dataset's complexity.MetaCRS：一种具有递归策略的无监督组装体聚类方法，用于降低宏基因组数据集的复杂度。

BMC Bioinformatics. 2022 Jan 20;22(Suppl 12):315. doi: 10.1186/s12859-021-04227-z.

MetaBinG: using GPUs to accelerate metagenomic sequence classification.MetaBinG：利用 GPU 加速宏基因组序列分类。

PLoS One. 2011;6(11):e25353. doi: 10.1371/journal.pone.0025353. Epub 2011 Nov 23.

引用本文的文献

Solving genomic puzzles: computational methods for metagenomic binning.解决基因组难题：宏基因组 binning 的计算方法。

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae372.

Step-by-Step Metagenomics for Food Microbiome Analysis: A Detailed Review.用于食品微生物组分析的逐步宏基因组学：详细综述

Foods. 2024 Jul 14;13(14):2216. doi: 10.3390/foods13142216.

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters.在蛋白质家族水平上探索微生物功能多样性——从宏基因组序列 reads 到注释的蛋白质簇。

Front Bioinform. 2023 Mar 3;3:1157956. doi: 10.3389/fbinf.2023.1157956. eCollection 2023.

MetaConClust - Unsupervised Binning of Metagenomics Data using Consensus Clustering.MetaConClust——使用一致性聚类对宏基因组学数据进行无监督分箱

Curr Genomics. 2022 Jun 10;23(2):137-146. doi: 10.2174/1389202923666220413114659.

Introduction to the principles and methods underlying the recovery of metagenome-assembled genomes from metagenomic data.从宏基因组数据中恢复宏基因组组装基因组的原理和方法简介。

Microbiologyopen. 2022 Jun;11(3):e1298. doi: 10.1002/mbo3.1298.

Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.宏基因组学音乐——应用、分析流程及其相关工具的综述。

Funct Integr Genomics. 2022 Feb;22(1):3-26. doi: 10.1007/s10142-021-00810-y. Epub 2021 Oct 18.

Antibiotic resistance: Time of synthesis in a post-genomic age.抗生素耐药性：后基因组时代的合成时间。

Comput Struct Biotechnol J. 2021 May 21;19:3110-3124. doi: 10.1016/j.csbj.2021.05.034. eCollection 2021.

Improving metagenomic binning results with overlapped bins using assembly graphs.利用组装图通过重叠分箱改进宏基因组分箱结果。

Algorithms Mol Biol. 2021 May 4;16(1):3. doi: 10.1186/s13015-021-00185-6.

Species complex delimitations in the genus : A machine learning approach for cluster discovery.该属中的物种复合体界定：一种用于聚类发现的机器学习方法。

Appl Plant Sci. 2020 Jul 31;8(7):e11377. doi: 10.1002/aps3.11377. eCollection 2020 Jul.

Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses.微生物生态学中的宏基因组学方法：全基因组和标记基因测序分析的最新进展。

Microb Genom. 2020 Aug;6(8). doi: 10.1099/mgen.0.000409. Epub 2020 Jul 24.

本文引用的文献

A novel abundance-based algorithm for binning metagenomic sequences using l-tuples.一种基于丰度的新型算法，用于使用l元组对宏基因组序列进行分箱。

J Comput Biol. 2011 Mar;18(3):523-34. doi: 10.1089/cmb.2010.0245.

Metagenomic sequencing of an in vitro-simulated microbial community.微生物群落体外模拟的宏基因组测序。

PLoS One. 2010 Apr 16;5(4):e10209. doi: 10.1371/journal.pone.0010209.

Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information.利用信息差异在层次树分解中寻找生物学上准确的聚类。

J Comput Biol. 2010 Mar;17(3):503-16. doi: 10.1089/cmb.2009.0173.

Alignment and clustering of phylogenetic markers--implications for microbial diversity studies.系统发育标记的聚类与对齐——对微生物多样性研究的启示。

BMC Bioinformatics. 2010 Mar 24;11:152. doi: 10.1186/1471-2105-11-152.

A human gut microbial gene catalogue established by metagenomic sequencing.宏基因组测序建立的人类肠道微生物基因目录。

Nature. 2010 Mar 4;464(7285):59-65. doi: 10.1038/nature08821.

Viral and microbial community dynamics in four aquatic environments.四种水生态环境中的病毒和微生物群落动态

ISME J. 2010 Jun;4(6):739-51. doi: 10.1038/ismej.2010.1. Epub 2010 Feb 11.

A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea.基于系统发育的细菌和古菌基因组百科全书。

Nature. 2009 Dec 24;462(7276):1056-60. doi: 10.1038/nature08656.

WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads.WebCARMA：一个用于未组装宏基因组读取的功能和分类学分类的网络应用程序。

BMC Bioinformatics. 2009 Dec 18;10:430. doi: 10.1186/1471-2105-10-430.

Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis.通过细菌宏基因组分析揭示的特殊结构化非编码RNA

Nature. 2009 Dec 3;462(7273):656-9. doi: 10.1038/nature08586.

The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata.《基因组在线数据库（GOLD）》2009 年报告：基因组和宏基因组项目及其相关元数据的现状。

Nucleic Acids Res. 2010 Jan;38(Database issue):D346-54. doi: 10.1093/nar/gkp848. Epub 2009 Nov 13.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于内插马尔可夫模型的宏基因组序列聚类。

Clustering metagenomic sequences with interpolated Markov models.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献