基于 l-mers 稳健选择的无监督环境基因组片段分箱。

Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers.

机构信息

State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing, Jiangsu, 210096 PR China.

出版信息

BMC Bioinformatics. 2010 Apr 16;11 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-11-S2-S5.

DOI:10.1186/1471-2105-11-S2-S5

PMID:20406503

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3165929/

Abstract

BACKGROUND

With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as "binning". Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases.

RESULTS

In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%.

CONCLUSIONS

We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/~alse/MetaCluster/.

摘要

背景

随着基因组测序技术的快速发展，基于微生物分离和培养的传统研究方法正逐渐被宏基因组学（也称为环境基因组学）所取代。宏基因组学的第一步，即对来自混合物种样本测序得到的 DNA 片段（reads）进行分类特征描述，仍然是一个主要的瓶颈。这一步通常被称为“binning”。现有的 binning 方法基于有监督或半监督的方法，这些方法严重依赖于已知微生物的参考基因组和系统发育标记基因。由于参考基因组的有限可用性以及标记基因的偏差和不稳定性，现有的 binning 方法在许多情况下可能不适用。

结果

在本文中，我们提出了一种基于精心选择的 l-mer （DNA 片段长度为 l 的子字符串）分布的无监督 binning 方法。通过实验，我们表明，我们的方法可以在不使用任何参考和训练数据集的情况下，准确地对具有不同长度和相对物种丰度比的 DNA 片段进行 binning。我们方法的另一个特点是错误稳健性。当测序错误率从 0%增加到 5%时，binning 准确率仅下降不到 1%。请注意，现有商业测序平台的典型测序错误率小于 2%。

结论

我们提供了一种新的有效的工具，用于解决宏基因组 binning 问题，而无需使用任何参考数据集或任何已知参考基因组（物种）的标记信息。我们软件工具的源代码、用于生成测试数据集的物种参考基因组以及相应的测试数据集可在 http://i.cs.hku.hk/~alse/MetaCluster/ 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba7/3165929/9bdd373f79b0/1471-2105-11-S2-S5-1.jpg

相似文献

Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers.基于 l-mers 稳健选择的无监督环境基因组片段分箱。

BMC Bioinformatics. 2010 Apr 16;11 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-11-S2-S5.

A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio.一种具有任意物种丰度比的宏基因组序列的健壮且准确的分箱算法。

Bioinformatics. 2011 Jun 1;27(11):1489-95. doi: 10.1093/bioinformatics/btr186. Epub 2011 Apr 14.

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample.MetaCluster 5.0：一种针对嘈杂样本中低丰度物种的元基因组数据的两阶段分箱方法。

Bioinformatics. 2012 Sep 15;28(18):i356-i362. doi: 10.1093/bioinformatics/bts397.

MetaProb 2: Metagenomic Reads Binning Based on Assembly Using Minimizers and K-Mers Statistics.MetaProb 2：基于组装使用最小化和 K- -mer 统计的宏基因组读取分箱。

J Comput Biol. 2021 Nov;28(11):1052-1062. doi: 10.1089/cmb.2021.0270. Epub 2021 Aug 26.

Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets.评估宏基因组工具在真实宏基因组数据集和 CAMI 数据集上的基因组 binning 效果。

BMC Bioinformatics. 2020 Jul 28;21(1):334. doi: 10.1186/s12859-020-03667-3.

MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning.MetaCluster-TA：基于组装辅助分箱的宏基因组数据分类注释。

BMC Genomics. 2014;15 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2164-15-S1-S12. Epub 2014 Jan 24.

MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species.MetaCluster 4.0：一种用于NGS读数和大量物种的新型分箱算法。

J Comput Biol. 2012 Feb;19(2):241-9. doi: 10.1089/cmb.2011.0276.

Exploiting topic modeling to boost metagenomic reads binning.利用主题建模来促进宏基因组读数分箱。

BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S2. doi: 10.1186/1471-2105-16-S5-S2. Epub 2015 Mar 18.

A novel abundance-based algorithm for binning metagenomic sequences using l-tuples.一种基于丰度的新型算法，用于使用l元组对宏基因组序列进行分箱。

J Comput Biol. 2011 Mar;18(3):523-34. doi: 10.1089/cmb.2010.0245.

Selection of marker genes for genetic barcoding of microorganisms and binning of metagenomic reads by Barcoder software tools.微生物遗传条形码标记基因的选择和 Barcoder 软件工具对宏基因组读段的分类。

BMC Bioinformatics. 2018 Aug 30;19(1):309. doi: 10.1186/s12859-018-2320-1.

引用本文的文献

Super-Enhancers and Their Parts: From Prediction Efforts to Pathognomonic Status.超级增强子及其组成部分：从预测努力到特征状态。

Int J Mol Sci. 2024 Mar 7;25(6):3103. doi: 10.3390/ijms25063103.

Genomic style: yet another deep-learning approach to characterize bacterial genome sequences.基因组风格：另一种用于表征细菌基因组序列的深度学习方法。

Bioinform Adv. 2021 Dec 1;1(1):vbab039. doi: 10.1093/bioadv/vbab039. eCollection 2021.

Binning unassembled short reads based on k-mer abundance covariance using sparse coding.基于 k-mer 丰度协方差的稀疏编码对未组装的短读进行分箱。

Gigascience. 2020 Apr 1;9(4). doi: 10.1093/gigascience/giaa028.

PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing.PaSiT：一种基于短寡核苷酸频率的新型方法，用于高效的细菌鉴定和分型。

Bioinformatics. 2020 Apr 15;36(8):2337-2344. doi: 10.1093/bioinformatics/btz964.

Studying microbial functionality within the gut ecosystem by systems biology.通过系统生物学研究肠道生态系统中的微生物功能。

Genes Nutr. 2018 Mar 6;13:5. doi: 10.1186/s12263-018-0594-6. eCollection 2018.

A framework for space-efficient read clustering in metagenomic samples.宏基因组样本中空间高效读取聚类的框架。

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):59. doi: 10.1186/s12859-017-1466-6.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.用于宏基因组差异分析的k-mer谱适用性评估。

BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.

Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data.刺胞动物门：原始和组装的基因组及转录组二代测序数据的快速、无参考聚类

BMC Bioinformatics. 2015 Nov 2;16:352. doi: 10.1186/s12859-015-0806-7.

MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities.MetaBAT是一种从复杂微生物群落中准确重建单个基因组的高效工具。

PeerJ. 2015 Aug 27;3:e1165. doi: 10.7717/peerj.1165. eCollection 2015.

Quality control of microbiota metagenomics by k-mer analysis.通过k-mer分析进行微生物群落宏基因组学的质量控制

BMC Genomics. 2015 Mar 14;16(1):183. doi: 10.1186/s12864-015-1406-7.

本文引用的文献

Microbes, inflammation, scaling and root planing, and the periodontal condition.微生物、炎症、龈下刮治术与根面平整术以及牙周状况。

J Dent Hyg. 2008 Oct;82 Suppl 3:4-9. Epub 2008 Oct 1.

TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach.TACOA：使用核化最近邻方法对环境基因组片段进行分类学分类。

BMC Bioinformatics. 2009 Feb 11;10:56. doi: 10.1186/1471-2105-10-56.

Functional and comparative metagenomic analysis of bile salt hydrolase activity in the human gut microbiome.人体肠道微生物群中胆汁盐水解酶活性的功能和比较宏基因组分析。

Proc Natl Acad Sci U S A. 2008 Sep 9;105(36):13580-5. doi: 10.1073/pnas.0804437105. Epub 2008 Aug 29.

Predominant role of host genetics in controlling the composition of gut microbiota.宿主遗传学在控制肠道微生物群组成中的主要作用。

PLoS One. 2008 Aug 26;3(8):e3064. doi: 10.1371/journal.pone.0003064.

Biodiversity and biogeography of phages in modern stromatolites and thrombolites.现代叠层石和凝块石中噬菌体的生物多样性与生物地理学

Nature. 2008 Mar 20;452(7185):340-3. doi: 10.1038/nature06735. Epub 2008 Mar 2.

Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.使用模拟数据集评估宏基因组学处理方法的保真度。

Nat Methods. 2007 Jun;4(6):495-500. doi: 10.1038/nmeth1043. Epub 2007 Apr 29.

MEGAN analysis of metagenomic data.宏基因组数据的MEGAN分析

Genome Res. 2007 Mar;17(3):377-86. doi: 10.1101/gr.5969107. Epub 2007 Jan 25.

Accurate phylogenetic classification of variable-length DNA fragments.可变长度DNA片段的精确系统发育分类。

Nat Methods. 2007 Jan;4(1):63-72. doi: 10.1038/nmeth976. Epub 2006 Dec 10.

Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies.使用16S rRNA和rpoB基因作为微生物生态学研究的分子标记。

Appl Environ Microbiol. 2007 Jan;73(1):278-88. doi: 10.1128/AEM.01177-06. Epub 2006 Oct 27.

Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities.两个强化生物除磷（EBPR）污泥群落的宏基因组分析。

Nat Biotechnol. 2006 Oct;24(10):1263-9. doi: 10.1038/nbt1247. Epub 2006 Sep 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于 l-mers 稳健选择的无监督环境基因组片段分箱。

Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献