基于 l-mers 稳健选择的无监督环境基因组片段分箱。

Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers.

机构信息

State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing, Jiangsu, 210096 PR China.

出版信息

BMC Bioinformatics. 2010 Apr 16;11 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-11-S2-S5.

Abstract

BACKGROUND

With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as "binning". Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases.

RESULTS

In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%.

CONCLUSIONS

We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/~alse/MetaCluster/.

摘要

背景

随着基因组测序技术的快速发展,基于微生物分离和培养的传统研究方法正逐渐被宏基因组学(也称为环境基因组学)所取代。宏基因组学的第一步,即对来自混合物种样本测序得到的 DNA 片段(reads)进行分类特征描述,仍然是一个主要的瓶颈。这一步通常被称为“binning”。现有的 binning 方法基于有监督或半监督的方法,这些方法严重依赖于已知微生物的参考基因组和系统发育标记基因。由于参考基因组的有限可用性以及标记基因的偏差和不稳定性,现有的 binning 方法在许多情况下可能不适用。

结果

在本文中,我们提出了一种基于精心选择的 l-mer (DNA 片段长度为 l 的子字符串)分布的无监督 binning 方法。通过实验,我们表明,我们的方法可以在不使用任何参考和训练数据集的情况下,准确地对具有不同长度和相对物种丰度比的 DNA 片段进行 binning。我们方法的另一个特点是错误稳健性。当测序错误率从 0%增加到 5%时,binning 准确率仅下降不到 1%。请注意,现有商业测序平台的典型测序错误率小于 2%。

结论

我们提供了一种新的有效的工具,用于解决宏基因组 binning 问题,而无需使用任何参考数据集或任何已知参考基因组(物种)的标记信息。我们软件工具的源代码、用于生成测试数据集的物种参考基因组以及相应的测试数据集可在 http://i.cs.hku.hk/~alse/MetaCluster/ 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba7/3165929/9bdd373f79b0/1471-2105-11-S2-S5-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索