Wang Ying, Wang Kun, Lu Yang Young, Sun Fengzhu
Department of Automation, Xiamen University, Xiamen, Fujian 361005 China.
Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, CA 90089 USA.
BMC Bioinformatics. 2017 Sep 20;18(1):425. doi: 10.1186/s12859-017-1835-1.
Metagenomics sequencing provides deep insights into microbial communities. To investigate their taxonomic structure, binning assembled contigs into discrete clusters is critical. Many binning algorithms have been developed, but their performance is not always satisfactory, especially for complex microbial communities, calling for further development.
According to previous studies, relative sequence compositions are similar across different regions of the same genome, but they differ between distinct genomes. Generally, current tools have used the normalized frequency of k-tuples directly, but this represents an absolute, not relative, sequence composition. Therefore, we attempted to model contigs using relative k-tuple composition, followed by measuring dissimilarity between contigs using [Formula: see text]. The [Formula: see text] was designed to measure the dissimilarity between two long sequences or Next-Generation Sequencing data with the Markov models of the background genomes. This method was effective in revealing group and gradient relationships between genomes, metagenomes and metatranscriptomes. With many binning tools available, we do not try to bin contigs from scratch. Instead, we developed [Formula: see text] to adjust contigs among bins based on the output of existing binning tools for a single metagenomic sample. The tool is taxonomy-free and depends only on k-tuples. To evaluate the performance of [Formula: see text], five widely used binning tools with different strategies of sequence composition or the hybrid of sequence composition and abundance were selected to bin six synthetic and real datasets, after which [Formula: see text] was applied to adjust the binning results. Our experiments showed that [Formula: see text] consistently achieves the best performance with tuple length k = 6 under the independent identically distributed (i.i.d.) background model. Using the metrics of recall, precision and ARI (Adjusted Rand Index), [Formula: see text] improves the binning performance in 28 out of 30 testing experiments (6 datasets with 5 binning tools). The [Formula: see text] is available at https://github.com/kunWangkun/d2SBin .
Experiments showed that [Formula: see text] accurately measures the dissimilarity between contigs of metagenomic reads and that relative sequence composition is more reasonable to bin the contigs. The [Formula: see text] can be applied to any existing contig-binning tools for single metagenomic samples to obtain better binning results.
宏基因组测序能深入洞察微生物群落。为研究其分类结构,将组装的重叠群归入离散簇至关重要。已开发出许多分箱算法,但其性能并不总是令人满意,尤其是对于复杂的微生物群落,需要进一步改进。
根据先前研究,同一基因组的不同区域相对序列组成相似,但不同基因组之间存在差异。一般来说,当前工具直接使用k-mer的归一化频率,但这代表的是绝对而非相对的序列组成。因此,我们尝试使用相对k-mer组成对重叠群进行建模,然后使用[公式:见正文]测量重叠群之间的差异。[公式:见正文]旨在使用背景基因组的马尔可夫模型测量两个长序列或二代测序数据之间的差异。该方法在揭示基因组、宏基因组和宏转录组之间的分组和梯度关系方面很有效。由于有许多分箱工具可用,我们并非试图从头开始对重叠群进行分箱。相反,我们开发了[公式:见正文],以便根据单个宏基因组样本现有分箱工具的输出在箱之间调整重叠群。该工具不依赖分类学,仅依赖k-mer。为评估[公式:见正文]的性能,选择了五种具有不同序列组成策略或序列组成与丰度混合策略的广泛使用的分箱工具对六个合成和真实数据集进行分箱,之后应用[公式:见正文]调整分箱结果。我们的实验表明,在独立同分布(i.i.d.)背景模型下,当元组长度k = 6时,[公式:见正文]始终实现最佳性能。使用召回率、精确率和ARI(调整兰德指数)指标,[公式:见正文]在30个测试实验中的28个(6个数据集与5种分箱工具)中提高了分箱性能。[公式:见正文]可在https://github.com/kunWangkun/d2SBin获取。
实验表明,[公式:见正文]能准确测量宏基因组读数重叠群之间的差异,且相对序列组成用于对重叠群进行分箱更合理。[公式:见正文]可应用于任何现有的针对单个宏基因组样本的重叠群分箱工具,以获得更好的分箱结果。