Institute of Bioinformatics and Biosignal Transduction, National Cheng Kung University, Tainan, Taiwan.
PLoS One. 2013 Jul 29;8(7):e69503. doi: 10.1371/journal.pone.0069503. Print 2013.
Next-Generation-Sequencing is advantageous because of its much higher data throughput and much lower cost compared with the traditional Sanger method. However, NGS reads are shorter than Sanger reads, making de novo genome assembly very challenging. Because genome assembly is essential for all downstream biological studies, great efforts have been made to enhance the completeness of genome assembly, which requires the presence of long reads or long distance information. To improve de novo genome assembly, we develop a computational program, ARF-PE, to increase the length of Illumina reads. ARF-PE takes as input Illumina paired-end (PE) reads and recovers the original DNA fragments from which two ends the paired reads are obtained. On the PE data of four bacteria, ARF-PE recovered >87% of the DNA fragments and achieved >98% of perfect DNA fragment recovery. Using Velvet, SOAPdenovo, Newbler, and CABOG, we evaluated the benefits of recovered DNA fragments to genome assembly. For all four bacteria, the recovered DNA fragments increased the assembly contiguity. For example, the N50 lengths of the P. brasiliensis contigs assembled by SOAPdenovo and Newbler increased from 80,524 bp to 166,573 bp and from 80,655 bp to 193,388 bp, respectively. ARF-PE also increased assembly accuracy in many cases. On the PE data of two fungi and a human chromosome, ARF-PE doubled and tripled the N50 length. However, the assembly accuracies dropped, but still remained >91%. In general, ARF-PE can increase both assembly contiguity and accuracy for bacterial genomes. For complex eukaryotic genomes, ARF-PE is promising because it raises assembly contiguity. But future error correction is needed for ARF-PE to also increase the assembly accuracy. ARF-PE is freely available at http://140.116.235.124/~tliu/arf-pe/.
下一代测序(Next-Generation-Sequencing)具有更高的数据通量和更低的成本优势,与传统的 Sanger 方法相比。然而,NGS 读取比 Sanger 读取短,使得从头基因组组装极具挑战性。因为基因组组装是所有下游生物学研究的基础,所以人们付出了巨大的努力来提高基因组组装的完整性,这需要长读长或长距离信息的存在。为了提高从头基因组组装的质量,我们开发了一个计算程序 ARF-PE,用于增加 Illumina 读取的长度。ARF-PE 以 Illumina 配对末端(PE)读取作为输入,并从获得配对读取的两个末端的原始 DNA 片段中恢复。在四个细菌的 PE 数据上,ARF-PE 恢复了超过 87%的 DNA 片段,实现了超过 98%的完美 DNA 片段恢复。使用 Velvet、SOAPdenovo、Newbler 和 CABOG,我们评估了恢复的 DNA 片段对基因组组装的好处。对于所有四个细菌,恢复的 DNA 片段都增加了组装的连续性。例如,使用 SOAPdenovo 和 Newbler 组装的 P. brasiliensis 基因组 contigs 的 N50 长度分别从 80,524 bp 增加到 166,573 bp 和从 80,655 bp 增加到 193,388 bp。ARF-PE 在许多情况下也提高了组装的准确性。在两个真菌和一个人类染色体的 PE 数据上,ARF-PE 将 N50 长度增加了两倍和三倍。然而,组装的准确性有所下降,但仍保持在 91%以上。总的来说,ARF-PE 可以提高细菌基因组的组装连续性和准确性。对于复杂的真核生物基因组,ARF-PE 很有前途,因为它提高了组装的连续性。但是,未来需要进行错误纠正,以使 ARF-PE 也能提高组装的准确性。ARF-PE 可在 http://140.116.235.124/~tliu/arf-pe/ 免费获取。