使用[公式：见正文]寡核苷酸频率差异改进宏基因组数据的重叠群分箱

Improving contig binning of metagenomic data using [Formula: see text] oligonucleotide frequency dissimilarity.

作者信息

Wang Ying, Wang Kun, Lu Yang Young, Sun Fengzhu

机构信息

Department of Automation, Xiamen University, Xiamen, Fujian 361005 China.

Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, CA 90089 USA.

出版信息

BMC Bioinformatics. 2017 Sep 20;18(1):425. doi: 10.1186/s12859-017-1835-1.

DOI:10.1186/s12859-017-1835-1

PMID:28931373

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5607646/

Abstract

BACKGROUND

Metagenomics sequencing provides deep insights into microbial communities. To investigate their taxonomic structure, binning assembled contigs into discrete clusters is critical. Many binning algorithms have been developed, but their performance is not always satisfactory, especially for complex microbial communities, calling for further development.

RESULTS

According to previous studies, relative sequence compositions are similar across different regions of the same genome, but they differ between distinct genomes. Generally, current tools have used the normalized frequency of k-tuples directly, but this represents an absolute, not relative, sequence composition. Therefore, we attempted to model contigs using relative k-tuple composition, followed by measuring dissimilarity between contigs using [Formula: see text]. The [Formula: see text] was designed to measure the dissimilarity between two long sequences or Next-Generation Sequencing data with the Markov models of the background genomes. This method was effective in revealing group and gradient relationships between genomes, metagenomes and metatranscriptomes. With many binning tools available, we do not try to bin contigs from scratch. Instead, we developed [Formula: see text] to adjust contigs among bins based on the output of existing binning tools for a single metagenomic sample. The tool is taxonomy-free and depends only on k-tuples. To evaluate the performance of [Formula: see text], five widely used binning tools with different strategies of sequence composition or the hybrid of sequence composition and abundance were selected to bin six synthetic and real datasets, after which [Formula: see text] was applied to adjust the binning results. Our experiments showed that [Formula: see text] consistently achieves the best performance with tuple length k = 6 under the independent identically distributed (i.i.d.) background model. Using the metrics of recall, precision and ARI (Adjusted Rand Index), [Formula: see text] improves the binning performance in 28 out of 30 testing experiments (6 datasets with 5 binning tools). The [Formula: see text] is available at https://github.com/kunWangkun/d2SBin .

CONCLUSIONS

Experiments showed that [Formula: see text] accurately measures the dissimilarity between contigs of metagenomic reads and that relative sequence composition is more reasonable to bin the contigs. The [Formula: see text] can be applied to any existing contig-binning tools for single metagenomic samples to obtain better binning results.

摘要

背景

宏基因组测序能深入洞察微生物群落。为研究其分类结构，将组装的重叠群归入离散簇至关重要。已开发出许多分箱算法，但其性能并不总是令人满意，尤其是对于复杂的微生物群落，需要进一步改进。

结果

根据先前研究，同一基因组的不同区域相对序列组成相似，但不同基因组之间存在差异。一般来说，当前工具直接使用k-mer的归一化频率，但这代表的是绝对而非相对的序列组成。因此，我们尝试使用相对k-mer组成对重叠群进行建模，然后使用[公式：见正文]测量重叠群之间的差异。[公式：见正文]旨在使用背景基因组的马尔可夫模型测量两个长序列或二代测序数据之间的差异。该方法在揭示基因组、宏基因组和宏转录组之间的分组和梯度关系方面很有效。由于有许多分箱工具可用，我们并非试图从头开始对重叠群进行分箱。相反，我们开发了[公式：见正文]，以便根据单个宏基因组样本现有分箱工具的输出在箱之间调整重叠群。该工具不依赖分类学，仅依赖k-mer。为评估[公式：见正文]的性能，选择了五种具有不同序列组成策略或序列组成与丰度混合策略的广泛使用的分箱工具对六个合成和真实数据集进行分箱，之后应用[公式：见正文]调整分箱结果。我们的实验表明，在独立同分布（i.i.d.）背景模型下，当元组长度k = 6时，[公式：见正文]始终实现最佳性能。使用召回率、精确率和ARI（调整兰德指数）指标，[公式：见正文]在30个测试实验中的28个（6个数据集与5种分箱工具）中提高了分箱性能。[公式：见正文]可在https://github.com/kunWangkun/d2SBin获取。

结论

实验表明，[公式：见正文]能准确测量宏基因组读数重叠群之间的差异，且相对序列组成用于对重叠群进行分箱更合理。[公式：见正文]可应用于任何现有的针对单个宏基因组样本的重叠群分箱工具，以获得更好的分箱结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/437e/5607646/aae0509781db/12859_2017_1835_Fig1_HTML.jpg

相似文献

Improving contig binning of metagenomic data using [Formula: see text] oligonucleotide frequency dissimilarity.使用[公式：见正文]寡核苷酸频率差异改进宏基因组数据的重叠群分箱

BMC Bioinformatics. 2017 Sep 20;18(1):425. doi: 10.1186/s12859-017-1835-1.

Accurate Binning of Metagenomic Contigs Using Composition, Coverage, and Assembly Graphs.基于组成、覆盖度和组装图对宏基因组序列进行精确分箱。

J Comput Biol. 2022 Dec;29(12):1357-1376. doi: 10.1089/cmb.2022.0262. Epub 2022 Nov 11.

CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision.CoMet：一种使用 contig 覆盖度和组成进行宏基因组样本高精度分箱的工作流程。

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):571. doi: 10.1186/s12859-017-1967-3.

HiFine: integrating Hi-C-based and shotgun-based methods to refine binning of metagenomic contigs.HiFine：整合基于 Hi-C 和 shotgun 的方法来优化宏基因组 contigs 的 bin 划分。

Bioinformatics. 2022 May 26;38(11):2973-2979. doi: 10.1093/bioinformatics/btac295.

Improving metagenomic binning results with overlapped bins using assembly graphs.利用组装图通过重叠分箱改进宏基因组分箱结果。

Algorithms Mol Biol. 2021 May 4;16(1):3. doi: 10.1186/s13015-021-00185-6.

AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides.AﬁTbin：一种基于初始和末端核苷酸的基于聚合 l-mer 频率的宏基因组序列拼接方法。

BMC Bioinformatics. 2024 Jul 16;25(1):241. doi: 10.1186/s12859-024-05859-7.

SolidBin: improving metagenome binning with semi-supervised normalized cut.SolidBin：利用半监督归一化割提高宏基因组 bin 划分。

Bioinformatics. 2019 Nov 1;35(21):4229-4238. doi: 10.1093/bioinformatics/btz253.

CH-Bin: A convex hull based approach for binning metagenomic contigs.CH-Bin：一种基于凸壳的宏基因组 contigs 分箱方法。

Comput Biol Chem. 2022 Oct;100:107734. doi: 10.1016/j.compbiolchem.2022.107734. Epub 2022 Jul 14.

METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs.METAMVGL：一种基于多视图图的宏基因组序列拼接 bin 算法，通过整合组装图和配对末端图。

BMC Bioinformatics. 2021 Jul 22;22(Suppl 10):378. doi: 10.1186/s12859-021-04284-4.

GraphBin: refined binning of metagenomic contigs using assembly graphs.GraphBin：使用组装图对宏基因组序列进行精细化分箱。

Bioinformatics. 2020 Jun 1;36(11):3307-3313. doi: 10.1093/bioinformatics/btaa180.

引用本文的文献

Solving genomic puzzles: computational methods for metagenomic binning.解决基因组难题：宏基因组 binning 的计算方法。

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae372.

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters.在蛋白质家族水平上探索微生物功能多样性——从宏基因组序列 reads 到注释的蛋白质簇。

Front Bioinform. 2023 Mar 3;3:1157956. doi: 10.3389/fbinf.2023.1157956. eCollection 2023.

Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.宏基因组学音乐——应用、分析流程及其相关工具的综述。

Funct Integr Genomics. 2022 Feb;22(1):3-26. doi: 10.1007/s10142-021-00810-y. Epub 2021 Oct 18.

Improving metagenomic binning results with overlapped bins using assembly graphs.利用组装图通过重叠分箱改进宏基因组分箱结果。

Algorithms Mol Biol. 2021 May 4;16(1):3. doi: 10.1186/s13015-021-00185-6.

Application of computational approaches to analyze metagenomic data.应用计算方法分析宏基因组数据。

J Microbiol. 2021 Mar;59(3):233-241. doi: 10.1007/s12275-021-0632-8. Epub 2021 Feb 10.

Classifying the Lifestyle of Metagenomically-Derived Phages Sequences Using Alignment-Free Methods.使用无比对方法对宏基因组来源的噬菌体序列的生活方式进行分类

Front Microbiol. 2020 Nov 12;11:567769. doi: 10.3389/fmicb.2020.567769. eCollection 2020.

Computational Modeling of the Human Microbiome.人类微生物组的计算建模

Microorganisms. 2020 Jan 31;8(2):197. doi: 10.3390/microorganisms8020197.

Alignment-Free Sequence Analysis and Applications.无比对序列分析及其应用

Annu Rev Biomed Data Sci. 2018 Jul;1:93-114. doi: 10.1146/annurev-biodatasci-080917-013431. Epub 2018 Apr 25.

Reads Binning Improves Alignment-Free Metagenome Comparison.读段分箱改进了无比对的宏基因组比较。

Front Genet. 2019 Nov 21;10:1156. doi: 10.3389/fgene.2019.01156. eCollection 2019.

MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.MetaBAT 2：一种用于从宏基因组组装中进行稳健且高效的基因组重建的自适应分箱算法。

PeerJ. 2019 Jul 26;7:e7359. doi: 10.7717/peerj.7359. eCollection 2019.

本文引用的文献

Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics.用于鸟枪法宏基因组学中不依赖分类学的序列分箱和可视化的生物信息学策略。

Comput Struct Biotechnol J. 2016 Dec 5;15:48-55. doi: 10.1016/j.csbj.2016.11.005. eCollection 2017.

Alignment-free $d_2^*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences.无比对的$d_2^*$寡核苷酸频率差异度量法可改善从宏基因组来源的病毒序列预测宿主的效果。

Nucleic Acids Res. 2017 Jan 9;45(1):39-53. doi: 10.1093/nar/gkw1002. Epub 2016 Nov 28.

Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains.基于可变长度马尔可夫链测序特征的无比对转录组和宏转录组比较

Sci Rep. 2016 Nov 23;6:37243. doi: 10.1038/srep37243.

COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge.可口可乐：利用序列组成、读段覆盖度、共比对和双端读段连接对宏基因组重叠群进行分箱。

Bioinformatics. 2017 Mar 15;33(6):791-798. doi: 10.1093/bioinformatics/btw290.

Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes.通过利用基因组特征和标记基因信息对序列进行自动聚类，实现宏基因组重叠群的精确分类。

Sci Rep. 2016 Apr 12;6:24175. doi: 10.1038/srep24175.

The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库：迈向更可持续的未来。

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets.MaxBin 2.0：一种从多个宏基因组数据集中恢复基因组的自动分箱算法。

Bioinformatics. 2016 Feb 15;32(4):605-7. doi: 10.1093/bioinformatics/btv638. Epub 2015 Oct 29.

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes.CheckM：评估从分离株、单细胞和宏基因组中获得的微生物基因组质量。

Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. Epub 2015 May 14.

MBBC: an efficient approach for metagenomic binning based on clustering.MBBC：一种基于聚类的宏基因组分箱高效方法。

BMC Bioinformatics. 2015 Feb 5;16:36. doi: 10.1186/s12859-015-0473-8.

VizBin - an application for reference-independent visualization and human-augmented binning of metagenomic data.VizBin - 一种用于元基因组数据参考独立可视化和人工分箱的应用程序。

Microbiome. 2015 Jan 20;3(1):1. doi: 10.1186/s40168-014-0066-1. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用[公式：见正文]寡核苷酸频率差异改进宏基因组数据的重叠群分箱

Improving contig binning of metagenomic data using [Formula: see text] oligonucleotide frequency dissimilarity.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献