Software Engineering Research Center, School of Software Engineering, Beijing Jiaotong University, Beijing, China.
Department of Botany and Plant Sciences, University of California, Riverside, CA, USA.
Bioinformatics. 2019 Oct 15;35(20):3953-3960. doi: 10.1093/bioinformatics/btz206.
The third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high-throughput self-correction. MECAT is currently among the fastest self-correction algorithms, but its throughput is relatively small (Xiao et al., 2017).
Here, we introduce FLAS, a wrapper algorithm of MECAT, to achieve high-throughput long-read self-correction while keeping MECAT's fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, FLAS also uses the corrected long-read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana and human long reads, FLAS can achieve 22.0-50.6% larger throughput than MECAT. FLAS is 2-13× faster compared to the self-correction algorithms other than MECAT, and its throughput is also 9.8-281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1-29.8% larger N50 sizes than MECAT.
The FLAS software can be downloaded for free from this site: https://github.com/baoe/flas.
Supplementary data are available at Bioinformatics online.
第三代 PacBio 长读长极大地促进了读长长度非常大的测序项目,但它们包含约 15%的测序错误,需要纠错。对于仅长读长的项目,快速进行纠错具有挑战性,并且也难以纠正足够数量的读碱基,即实现高通量自我纠错。MECAT 是目前最快的自我纠错算法之一,但它的通量相对较小(Xiao 等人,2017)。
在这里,我们引入了 FLAS,这是 MECAT 的包装算法,可在保持 MECAT 快速速度的同时实现高通量长读长自我纠错。FLAS 从 MECAT 预对齐的长读长中找到额外的比对,以提高纠错通量,并去除错误比对以提高准确性。此外,FLAS 还使用已校正的长读长区域来校正未校正的区域,以进一步提高吞吐量。在我们对大肠杆菌、酿酒酵母、拟南芥和人类长读长的性能测试中,FLAS 可以实现比 MECAT 大 22.0-50.6%的吞吐量。与除 MECAT 之外的其他自我纠错算法相比,FLAS 的速度快 2-13 倍,其吞吐量也大 9.8-281.8%。FLAS 校正的长读长可以组装成比 MECAT 大 13.1-29.8%的 N50 大小的 contigs。
FLAS 软件可从以下网址免费下载:https://github.com/baoe/flas。
补充数据可在生物信息学在线获得。