Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil.
BMC Bioinformatics. 2012 May 14;13:96. doi: 10.1186/1471-2105-13-96.
Decreasing costs of DNA sequencing have made prokaryotic draft genome sequences increasingly common. A contig scaffold is an ordering of contigs in the correct orientation. A scaffold can help genome comparisons and guide gap closure efforts. One popular technique for obtaining contig scaffolds is to map contigs onto a reference genome. However, rearrangements that may exist between the query and reference genomes may result in incorrect scaffolds, if these rearrangements are not taken into account. Large-scale inversions are common rearrangement events in prokaryotic genomes. Even in draft genomes it is possible to detect the presence of inversions given sufficient sequencing coverage and a sufficiently close reference genome.
We present a linear-time algorithm that can generate a set of contig scaffolds for a draft genome sequence represented in contigs given a reference genome. The algorithm is aimed at prokaryotic genomes and relies on the presence of matching sequence patterns between the query and reference genomes that can be interpreted as the result of large-scale inversions; we call these patterns inversion signatures. Our algorithm is capable of correctly generating a scaffold if at least one member of every inversion signature pair is present in contigs and no inversion signatures have been overwritten in evolution. The algorithm is also capable of generating scaffolds in the presence of any kind of inversion, even though in this general case there is no guarantee that all scaffolds in the scaffold set will be correct. We compare the performance of sis, the program that implements the algorithm, to seven other scaffold-generating programs. The results of our tests show that sis has overall better performance.
sis is a new easy-to-use tool to generate contig scaffolds, available both as stand-alone and as a web server. The good performance of sis in our tests adds evidence that large-scale inversions are widespread in prokaryotic genomes.
DNA 测序成本的降低使得原核生物草图基因组序列越来越常见。重叠群支架是重叠群的正确定向排序。支架可以帮助基因组比较并指导缺口闭合工作。获得重叠群支架的一种流行技术是将重叠群映射到参考基因组上。然而,如果不考虑查询和参考基因组之间可能存在的重排,这些重排可能会导致不正确的支架。大规模倒位是原核生物基因组中常见的重排事件。即使在草图基因组中,只要有足够的测序覆盖度和足够接近的参考基因组,也有可能检测到倒位的存在。
我们提出了一种线性时间算法,可以为给定参考基因组的重叠群生成草图基因组序列的一组重叠群支架。该算法针对原核生物基因组,依赖于查询和参考基因组之间存在匹配的序列模式,这些模式可以解释为大规模倒位的结果;我们称这些模式为倒位特征。如果至少有一个倒位特征对的成员存在于重叠群中,并且没有倒位特征在进化中被覆盖,那么我们的算法能够正确地生成支架。该算法还能够在存在任何类型的倒位的情况下生成支架,尽管在这种一般情况下,不能保证支架集中的所有支架都是正确的。我们将 sis 程序(实现该算法的程序)的性能与其他七个生成支架的程序进行了比较。我们的测试结果表明,sis 总体上具有更好的性能。
sis 是一种新的易于使用的生成重叠群支架的工具,既可以作为独立程序使用,也可以作为网络服务器使用。sis 在我们的测试中表现良好,这进一步证明了大规模倒位在原核生物基因组中广泛存在。