Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA.
Department of Biostatistics, University of Iowa, Iowa City, IA, 52242, USA.
Genome Biol. 2019 Feb 4;20(1):26. doi: 10.1186/s13059-018-1605-z.
Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies. However, their notorious high error rate impedes straightforward data analysis and limits their application. A handful of error correction methods for these error-prone long reads have been developed to date. The output data quality is very important for downstream analysis, whereas computing resources could limit the utility of some computing-intense tools. There is a lack of standardized assessments for these long-read error-correction methods.
Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences.
Taking into account all of these metrics, we provide a suggestive guideline for method choice based on available data size, computing resources, and individual research goals.
第三代测序技术通过生成比第二代测序技术长得多的读段,推动了生物研究的进展。然而,其臭名昭著的高错误率阻碍了直接的数据分析,限制了其应用。迄今为止,已经开发了一些针对这些易错长读段的纠错方法。输出数据质量对下游分析非常重要,而计算资源可能会限制一些计算密集型工具的应用。目前缺乏针对这些长读段纠错方法的标准化评估。
在这里,我们对十种最先进的长读段纠错方法进行了性能评估。我们为性能评估建立了一套通用的基准,包括灵敏度、准确性、输出率、比对率、输出读长、运行时间和内存使用,以及纠错对长读的两个下游应用(从头组装和解决单倍型序列)的影响。
考虑到所有这些指标,我们根据可用数据量、计算资源和个人研究目标,提供了一种基于方法选择的建议性指导。