The Delft Bioinformatics Lab, Department of Mediamatics, Delft University of Technology, Mekelweg 4, Delft.
Bioinformatics. 2012 Jun 1;28(11):1429-37. doi: 10.1093/bioinformatics/bts175. Epub 2012 Apr 6.
The increasing availability of second-generation high-throughput sequencing (HTS) technologies has sparked a growing interest in de novo genome sequencing. This in turn has fueled the need for reliable means of obtaining high-quality draft genomes from short-read sequencing data. The millions of reads usually involved in HTS experiments are first assembled into longer fragments called contigs, which are then scaffolded, i.e. ordered and oriented using additional information, to produce even longer sequences called scaffolds. Most existing scaffolders of HTS genome assemblies are not suited for using information other than paired reads to perform scaffolding. They use this limited information to construct scaffolds, often preferring scaffold length over accuracy, when faced with the tradeoff.
We present GRASS (GeneRic ASsembly Scaffolder)-a novel algorithm for scaffolding second-generation sequencing assemblies capable of using diverse information sources. GRASS offers a mixed-integer programming formulation of the contig scaffolding problem, which combines contig order, distance and orientation in a single optimization objective. The resulting optimization problem is solved using an expectation-maximization procedure and an unconstrained binary quadratic programming approximation of the original problem. We compared GRASS with existing HTS scaffolders using Illumina paired reads of three bacterial genomes. Our algorithm constructs a comparable number of scaffolds, but makes fewer errors. This result is further improved when additional data, in the form of related genome sequences, are used.
GRASS source code is freely available from http://code.google.com/p/tud-scaffolding/.
Supplementary data are available at Bioinformatics online.
第二代高通量测序(HTS)技术的日益普及激发了人们对从头基因组测序的浓厚兴趣。这反过来又推动了人们对从短读测序数据中获得高质量草图基因组的可靠方法的需求。HTS 实验通常涉及数百万个读取,这些读取首先被组装成长度较长的片段,称为 contigs,然后使用其他信息进行支架构建,即排序和定向,以生成更长的序列,称为 scaffolds。大多数现有的 HTS 基因组组装支架构建器不适合使用除配对读取以外的信息来执行支架构建。当面临这种权衡时,它们使用这种有限的信息来构建支架,通常更倾向于支架长度而不是准确性。
我们提出了 GRASS(通用组装支架构建器)——一种能够使用多种信息源的第二代测序组装支架构建的新算法。GRASS 提供了一个用于 contig 支架构建问题的混合整数规划公式,该公式将 contig 的顺序、距离和定向组合到一个单一的优化目标中。通过使用期望最大化过程和原始问题的无约束二进制二次规划近似来解决由此产生的优化问题。我们使用 Illumina 对三个细菌基因组的配对读取与现有的 HTS 支架构建器进行了比较。我们的算法构建了数量相当的支架,但错误较少。当使用其他形式的相关基因组序列等附加数据时,结果会进一步得到改善。
GRASS 源代码可从 http://code.google.com/p/tud-scaffolding/ 免费获得。
补充数据可在 Bioinformatics 在线获得。