McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
Bioinformatics. 2011 Nov 1;27(21):2957-63. doi: 10.1093/bioinformatics/btr507. Epub 2011 Sep 7.
Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome.
We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds.
The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash.
新一代测序技术会产生大量的短读段。即使基因组的覆盖深度非常高,短读段的长度也会给从头组装带来问题。使用片段长度短于读段长度两倍的配对末端文库,通过在组装基因组之前对读段对进行重叠和合并,可以生成更长的读段。
我们提出了一种快速的计算工具 FLASH,用于通过重叠来自足够短的片段文库的配对末端读段来延长短读段的长度。我们在一百万对模拟读段上测试了该工具的正确性,然后将其应用于来自金黄色葡萄球菌的 Illumina 读段和人类染色体 14 的基因组组装的预处理。FLASH 能够以 <1%的错误率正确地扩展和合并模拟读段中 >99%的读段。在设置适当的参数时,即使读段中包含高达 5%的错误,FLASH 也能正确地合并读段超过 90%的时间。当在组装之前使用 FLASH 来扩展读段时,生成的组装在 contigs 和 scaffolds 方面的 N50 长度都有显著提高。
FLASH 系统是用 C 编写的,作为开源代码在 http://www.cbcb.umd.edu/software/flash 上免费提供。