MetaConClust——使用一致性聚类对宏基因组学数据进行无监督分箱

MetaConClust - Unsupervised Binning of Metagenomics Data using Consensus Clustering.

作者信息

Sinha Dipro, Sharma Anu, Mishra Dwijesh Chandra, Rai Anil, Lal Shashi Bhushan, Kumar Sanjeev, Farooqi Moh Samir, Chaturvedi Krishna Kumar

机构信息

1Research Scholar, PG School, ICAR-IARI, New Delhi-110012, India; 2Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India.

出版信息

Curr Genomics. 2022 Jun 10;23(2):137-146. doi: 10.2174/1389202923666220413114659.

DOI:10.2174/1389202923666220413114659

PMID:36778980

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9878838/

Abstract

Binning of metagenomic reads is an active area of research, and many unsupervised machine learning-based techniques have been used for taxonomic independent binning of metagenomic reads. It is important to find the optimum number of the cluster as well as develop an efficient pipeline for deciphering the complexity of the microbial genome. Applying unsupervised clustering techniques for binning requires finding the optimal number of clusters beforehand and is observed to be a difficult task. This paper describes a novel method, MetaConClust, using coverage information for grouping of contigs and automatically finding the optimal number of clusters for binning of metagenomics data using a consensus-based clustering approach. The coverage of contigs in a metagenomics sample has been observed to be directly proportional to the abundance of species in the sample and is used for grouping of data in the first phase by MetaConClust. The Partitioning Around Medoid (PAM) method is used for clustering in the second phase for generating bins with the initial number of clusters determined automatically through a consensus-based method. Finally, the quality of the obtained bins is tested using silhouette index, rand Index, recall, precision, and accuracy. Performance of MetaConClust is compared with recent methods and tools using benchmarked low complexity simulated and real metagenomic datasets and is found better for unsupervised and comparable for hybrid methods. This is suggestive of the proposition that the consensus-based clustering approach is a promising method for automatically finding the number of bins for metagenomics data.

摘要

宏基因组 reads 的分箱是一个活跃的研究领域，许多基于无监督机器学习的技术已被用于宏基因组 reads 的分类独立分箱。找到最佳的聚类数量以及开发一个有效的流程来解读微生物基因组的复杂性很重要。应用无监督聚类技术进行分箱需要事先找到最佳的聚类数量，并且这被认为是一项艰巨的任务。本文描述了一种新方法 MetaConClust，它使用覆盖信息对重叠群进行分组，并使用基于共识的聚类方法自动找到宏基因组学数据分箱的最佳聚类数量。已观察到宏基因组学样本中重叠群的覆盖度与样本中物种的丰度成正比，并且在第一阶段 MetaConClust 用其对数据进行分组。在第二阶段使用围绕中心点划分（PAM）方法进行聚类，以生成初始聚类数量通过基于共识的方法自动确定的分箱。最后，使用轮廓系数、兰德指数、召回率、精确率和准确率来测试所获得分箱的质量。使用基准化的低复杂度模拟和真实宏基因组数据集，将 MetaConClust 的性能与最近的方法和工具进行比较，发现它在无监督方法方面表现更好，在混合方法方面与之相当。这表明基于共识的聚类方法是一种自动找到宏基因组学数据分箱数量的有前途的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1198/9878838/23877069c1c3/CG-23-137_F1.jpg

相似文献

MetaConClust - Unsupervised Binning of Metagenomics Data using Consensus Clustering.MetaConClust——使用一致性聚类对宏基因组学数据进行无监督分箱

Curr Genomics. 2022 Jun 10;23(2):137-146. doi: 10.2174/1389202923666220413114659.

Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets.评估宏基因组工具在真实宏基因组数据集和 CAMI 数据集上的基因组 binning 效果。

BMC Bioinformatics. 2020 Jul 28;21(1):334. doi: 10.1186/s12859-020-03667-3.

CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision.CoMet：一种使用 contig 覆盖度和组成进行宏基因组样本高精度分箱的工作流程。

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):571. doi: 10.1186/s12859-017-1967-3.

A Deep Clustering-based Novel Approach for Binning of Metagenomics Data.一种基于深度聚类的宏基因组学数据分箱新方法。

Curr Genomics. 2022 Nov 18;23(5):353-368. doi: 10.2174/1389202923666220928150100.

Binning Metagenomic Contigs Using Unsupervised Clustering and Reference Databases.使用无监督聚类和参考数据库对宏基因组重叠群进行分箱

Interdiscip Sci. 2022 Dec;14(4):795-803. doi: 10.1007/s12539-022-00526-y. Epub 2022 May 31.

Accurate Binning of Metagenomic Contigs Using Composition, Coverage, and Assembly Graphs.基于组成、覆盖度和组装图对宏基因组序列进行精确分箱。

J Comput Biol. 2022 Dec;29(12):1357-1376. doi: 10.1089/cmb.2022.0262. Epub 2022 Nov 11.

Binning long reads in metagenomics datasets using composition and coverage information.利用组成和覆盖信息对宏基因组学数据集中的长读段进行分箱。

Algorithms Mol Biol. 2022 Jul 11;17(1):14. doi: 10.1186/s13015-022-00221-z.

Improving contig binning of metagenomic data using [Formula: see text] oligonucleotide frequency dissimilarity.使用[公式：见正文]寡核苷酸频率差异改进宏基因组数据的重叠群分箱

BMC Bioinformatics. 2017 Sep 20;18(1):425. doi: 10.1186/s12859-017-1835-1.

HiFine: integrating Hi-C-based and shotgun-based methods to refine binning of metagenomic contigs.HiFine：整合基于 Hi-C 和 shotgun 的方法来优化宏基因组 contigs 的 bin 划分。

Bioinformatics. 2022 May 26;38(11):2973-2979. doi: 10.1093/bioinformatics/btac295.

BMC3C: binning metagenomic contigs using codon usage, sequence composition and read coverage.BMC3C：基于密码子使用、序列组成和读段覆盖度对宏基因组 contigs 进行分箱。

Bioinformatics. 2018 Dec 15;34(24):4172-4179. doi: 10.1093/bioinformatics/bty519.

引用本文的文献

A review of neural networks for metagenomic binning.宏基因组分箱的神经网络综述。

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf065.

MethSemble-6mA: an ensemble-based 6mA prediction server and its application on promoter region of LBD gene family in Poaceae.MethSemble-6mA：一种基于集成学习的6mA预测服务器及其在禾本科LBD基因家族启动子区域的应用

Front Plant Sci. 2023 Oct 9;14:1256186. doi: 10.3389/fpls.2023.1256186. eCollection 2023.

A Deep Clustering-based Novel Approach for Binning of Metagenomics Data.一种基于深度聚类的宏基因组学数据分箱新方法。

Curr Genomics. 2022 Nov 18;23(5):353-368. doi: 10.2174/1389202923666220928150100.

本文引用的文献

MetaCon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage.MetaCon：基于概率 k- -mer 统计和覆盖度的无监督宏基因组序列聚类

BMC Bioinformatics. 2019 Nov 22;20(Suppl 9):367. doi: 10.1186/s12859-019-2904-4.

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):571. doi: 10.1186/s12859-017-1967-3.

Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics.用于鸟枪法宏基因组学中不依赖分类学的序列分箱和可视化的生物信息学策略。

Comput Struct Biotechnol J. 2016 Dec 5;15:48-55. doi: 10.1016/j.csbj.2016.11.005. eCollection 2017.

COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge.可口可乐：利用序列组成、读段覆盖度、共比对和双端读段连接对宏基因组重叠群进行分箱。

Bioinformatics. 2017 Mar 15;33(6):791-798. doi: 10.1093/bioinformatics/btw290.

Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes.通过利用基因组特征和标记基因信息对序列进行自动聚类，实现宏基因组重叠群的精确分类。

Sci Rep. 2016 Apr 12;6:24175. doi: 10.1038/srep24175.

MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities.MetaBAT是一种从复杂微生物群落中准确重建单个基因组的高效工具。

PeerJ. 2015 Aug 27;3:e1165. doi: 10.7717/peerj.1165. eCollection 2015.

CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.克拉克：使用判别性k-mer对宏基因组和基因组序列进行快速准确分类

BMC Genomics. 2015 Mar 25;16(1):236. doi: 10.1186/s12864-015-1419-2.

GroopM: an automated tool for the recovery of population genomes from related metagenomes.GroopM：一种从相关宏基因组中恢复种群基因组的自动化工具。

PeerJ. 2014 Sep 30;2:e603. doi: 10.7717/peerj.603. eCollection 2014.

Kraken: ultrafast metagenomic sequence classification using exact alignments.克拉肯：使用精确比对的超快速宏基因组序列分类

Genome Biol. 2014 Mar 3;15(3):R46. doi: 10.1186/gb-2014-15-3-r46.

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample.MetaCluster 5.0：一种针对嘈杂样本中低丰度物种的元基因组数据的两阶段分箱方法。

Bioinformatics. 2012 Sep 15;28(18):i356-i362. doi: 10.1093/bioinformatics/bts397.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

MetaConClust——使用一致性聚类对宏基因组学数据进行无监督分箱

MetaConClust - Unsupervised Binning of Metagenomics Data using Consensus Clustering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献