CLC bio, 8200 Aarhus N, Denmark.
Department of Biology, Box 118525, University of Florida, Gainesville, Florida, 32611-8525, USA.
Genes (Basel). 2010 Sep 13;1(2):263-82. doi: 10.3390/genes1020263.
This study presents a new computer program for assessing the effects of different factors and sequencing strategies on de novo sequence assembly. The program uses reads from actual sequencing studies or from simulations with a reference genome that may also be real or simulated. The simulated reads can be created with our read simulator. They can be of differing length and coverage, consist of paired reads with varying distance, and include sequencing errors such as color space miscalls to imitate SOLiD data. The simulated or real reads are mapped to their reference genome and our assembly simulator is then used to obtain optimal assemblies that are limited only by the distribution of repeats. By way of this mapping, the assembly simulator determines which contigs are theoretically possible, or conversely (and perhaps more importantly), which are not. We illustrate the application and utility of our new simulation tools with several experiments that test the effects of genome complexity (repeats), read length and coverage, word size in De Bruijn graph assembly, and alternative sequencing strategies (e.g., BAC pooling) on sequence assemblies. These experiments highlight just some of the uses of our simulators in the experimental design of sequencing projects and in the further development of assembly algorithms.
本研究提出了一个新的计算机程序,用于评估不同因素和测序策略对从头序列组装的影响。该程序使用来自实际测序研究或使用参考基因组(也可以是真实的或模拟的)进行模拟的读取。模拟读取可以使用我们的读取模拟器创建。它们可以具有不同的长度和覆盖度,可以由具有不同距离的成对读取组成,并包括测序错误,例如颜色空间误报,以模拟 SOLiD 数据。模拟或真实的读取被映射到它们的参考基因组,然后使用我们的组装模拟器来获得最佳组装,这些组装仅受重复分布的限制。通过这种映射,组装模拟器确定了哪些 contigs 在理论上是可能的,或者相反(也许更重要的是),哪些是不可能的。我们通过几个实验说明了我们的新模拟工具的应用和实用性,这些实验测试了基因组复杂性(重复)、读取长度和覆盖度、De Bruijn 图组装中的字长以及替代测序策略(例如 BAC 池化)对序列组装的影响。这些实验突出了我们的模拟器在测序项目的实验设计以及组装算法的进一步发展中的一些用途。