Heydari Mahdi, Miclotte Giles, Demeester Piet, Van de Peer Yves, Fostier Jan
Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.
Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.
BMC Bioinformatics. 2017 Aug 18;18(1):374. doi: 10.1186/s12859-017-1784-8.
Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods.
For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy.
We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
最近,人们提出了许多独立应用程序来校正Illumina数据中的测序错误。其关键思想是,诸如从头基因组组装程序等下游分析工具会从输入数据中降低的错误率中受益。令人惊讶的是,即使对于最近发表的方法,也缺乏使用最先进的组装方法对这一假设进行系统验证。
对于十二种近期的Illumina错误校正工具(EC工具),我们评估了它们校正测序错误的能力以及在重叠群大小和准确性方面改善从头基因组组装的能力。
我们证实,大多数EC工具减少了测序数据中的错误数量,且未引入许多新错误。然而,我们发现许多EC工具在某些序列背景下表现不佳,例如低覆盖区域或包含短重复或低复杂性序列的区域。与这些区域重叠的 reads 常常以不一致的方式校正错误,导致最终组装中出现断点,而这些断点在未校正数据得到的组装中并不存在。解决未来EC工具中的这一系统缺陷可以大大提高此类工具的适用性。