School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA.
Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332, GA, USA.
BMC Genomics. 2020 Dec 21;21(Suppl 6):889. doi: 10.1186/s12864-020-07227-0.
Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used.
In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.
Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE .
第三代单分子测序技术可以对长读段进行测序,从而推进基因组学研究的前沿。然而,其高错误率阻碍了下游分析的准确性和高效性。这一难题促使许多长读段错误纠正工具得以开发,这些工具通过对冗余数据进行采样和/或利用相同生物样本的准确短读段来解决这一问题。现有评估这些工具的研究使用模拟数据集,所涵盖的软件范围和使用的评估指标多样性不够全面。
在本文中,我们对长读段错误纠正方法进行了分类和综述,并对相应的长读段错误纠正工具进行了全面评估。我们利用最新的真实测序数据,建立了基准数据集,并设置了评估标准,用于进行比较评估,包括错误纠正质量以及运行时和内存使用情况。我们研究了在长度分布和校正后基因组覆盖度方面,修剪和长读段测序深度如何影响错误校正,以及错误校正性能对长读段的一个重要应用,即基因组组装的影响。我们为从业者提供了在可用错误纠正工具之间进行选择的指南,并确定了未来研究的方向。
尽管长读段错误率很高,但最先进的纠错工具可以实现高质量的纠错。当有短读段可用时,最佳的混合方法在纠错质量和计算资源使用方面优于非混合方法。在选择工具时,从业者应注意有几个会丢弃读段的纠错工具,并检查错误纠正工具对下游分析的影响。我们的评估代码可在 https://github.com/haowenz/LRECE 上作为开源代码获取。