对宏基因组序列进行预处理的自然模式的碱基组成分析。

A base composition analysis of natural patterns for the preprocessing of metagenome sequences.

出版信息

BMC Bioinformatics. 2013;14 Suppl 11(Suppl 11):S5. doi: 10.1186/1471-2105-14-S11-S5. Epub 2013 Nov 4.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3816298/

Abstract

BACKGROUND

On the pretext that sequence reads and contigs often exhibit the same kinds of base usage that is also observed in the sequences from which they are derived, we offer a base composition analysis tool. Our tool uses these natural patterns to determine relatedness across sequence data. We introduce spectrum sets (sets of motifs) which are permutations of bacterial restriction sites and the base composition analysis framework to measure their proportional content in sequence data. We suggest that this framework will increase the efficiency during the pre-processing stages of metagenome sequencing and assembly projects.

RESULTS

Our method is able to differentiate organisms and their reads or contigs. The framework shows how to successfully determine the relatedness between these reads or contigs by comparison of base composition. In particular, we show that two types of organismal-sequence data are fundamentally different by analyzing their spectrum set motif proportions (coverage). By the application of one of the four possible spectrum sets, encompassing all known restriction sites, we provide the evidence to claim that each set has a different ability to differentiate sequence data. Furthermore, we show that the spectrum set selection having relevance to one organism, but not to the others of the data set, will greatly improve performance of sequence differentiation even if the fragment size of the read, contig or sequence is not lengthy.

CONCLUSIONS

We show the proof of concept of our method by its application to ten trials of two or three freshly selected sequence fragments (reads and contigs) for each experiment across the six organisms of our set. Here we describe a novel and computationally effective pre-processing step for metagenome sequencing and assembly tasks. Furthermore, our base composition method has applications in phylogeny where it can be used to infer evolutionary distances between organisms based on the notion that related organisms often have much conserved code.

摘要

背景

基于序列读取和重叠群通常表现出与它们所衍生的序列相同的碱基使用模式的假设，我们提供了一种碱基组成分析工具。我们的工具利用这些自然模式来确定序列数据之间的相关性。我们引入了频谱集（基序集），这些基序集是细菌限制位点的排列组合，以及碱基组成分析框架，以测量它们在序列数据中的比例含量。我们建议，该框架将提高宏基因组测序和组装项目的预处理阶段的效率。

结果

我们的方法能够区分生物体及其读取或重叠群。该框架展示了如何通过比较碱基组成成功确定这些读取或重叠群之间的相关性。特别是，我们通过分析其频谱集基序比例（覆盖度）来展示两种类型的生物序列数据是如何从根本上不同的。通过应用四种可能的频谱集之一，包括所有已知的限制位点，我们提供了证据，证明每个集都有不同的区分序列数据的能力。此外，我们还表明，对于一组数据中的一个生物体相关但对其他生物体不相关的频谱集选择，即使读取、重叠群或序列的片段大小不冗长，也将极大地提高序列区分的性能。

结论

我们通过将该方法应用于我们的六个生物体中的每一个的两个或三个新选择的序列片段（读取和重叠群）的十次实验，证明了该方法的概念验证。在这里，我们描述了一种新颖且计算有效的宏基因组测序和组装任务的预处理步骤。此外，我们的碱基组成方法在系统发育学中有应用，它可以用于根据相关生物体通常具有高度保守的代码的概念推断生物体之间的进化距离。