Department of Mathematics and Computer Science, University of Udine, Udine 33100, Italy.
BMC Bioinformatics. 2012;13 Suppl 14(Suppl 14):S8. doi: 10.1186/1471-2105-13-S14-S8. Epub 2012 Sep 7.
Next Generation Sequencing technologies are able to provide high genome coverages at a relatively low cost. However, due to limited reads' length (from 30 bp up to 200 bp), specific bioinformatics problems have become even more difficult to solve. De novo assembly with short reads, for example, is more complicated at least for two reasons: first, the overall amount of "noisy" data to cope with increased and, second, as the reads' length decreases the number of unsolvable repeats grows. Our work's aim is to go at the root of the problem by providing a pre-processing tool capable to produce (in-silico) longer and highly accurate sequences from a collection of Next Generation Sequencing reads.
In this paper a seed-and-extend local assembler is presented. The kernel algorithm is a loop that, starting from a read used as seed, keeps extending it using heuristics whose main goal is to produce a collection of error-free and longer sequences. In particular, GapFiller carefully detects reliable overlaps and operates clustering similar reads in order to reconstruct the missing part between the two ends of the same insert. Our tool's output has been validated on 24 experiments using both simulated and real paired reads datasets. The output sequences are declared correct when the seed-mate is found. In the experiments performed, GapFiller was able to extend high percentages of the processed seeds and find their mates, with a false positives rate that turned out to be nearly negligible.
GapFiller, starting from a sufficiently high short reads coverage, is able to produce high coverages of accurate longer sequences (from 300 bp up to 3500 bp). The procedure to perform safe extensions, together with the mate-found check, turned out to be a powerful criterion to guarantee contigs' correctness. GapFiller has further potential, as it could be applied in a number of different scenarios, including the post-processing validation of insertions/deletions detection pipelines, pre-processing routines on datasets for de novo assembly pipelines, or in any hierarchical approach designed to assemble, analyse or validate pools of sequences.
下一代测序技术能够以相对较低的成本提供高基因组覆盖率。然而,由于读长有限(30bp 到 200bp),特定的生物信息学问题变得更加难以解决。例如,用短读长进行从头组装更加复杂,至少有两个原因:首先,需要处理的“嘈杂”数据量增加;其次,由于读长减小,无法解决的重复序列数量增加。我们的工作旨在通过提供一种预处理工具来解决这个问题,该工具能够从一组下一代测序读长中生成(计算机模拟的)更长和高度准确的序列。
本文提出了一种基于种子和扩展的局部组装算法。核心算法是一个循环,从一个用作种子的读长开始,使用启发式方法不断扩展它,其主要目标是生成一组无错误且更长的序列。特别是,GapFiller 仔细检测可靠的重叠,并对相似的读长进行聚类,以重建同一插入物两端之间缺失的部分。我们的工具的输出在使用模拟和真实成对读长数据集的 24 个实验中得到了验证。当找到种子的配对时,输出序列被声明为正确。在执行的实验中,GapFiller 能够扩展高比例的处理种子并找到它们的配对,假阳性率几乎可以忽略不计。
GapFiller 从足够高的短读长覆盖率开始,能够生成高覆盖率的准确长序列(300bp 到 3500bp)。执行安全扩展的过程以及配对检查结果证明是保证重叠群正确性的有力标准。GapFiller 还有进一步的潜力,因为它可以应用于许多不同的场景,包括插入/缺失检测管道的后处理验证、从头组装管道数据集的预处理例程,或用于组装、分析或验证序列池的任何分层方法。