Zhu Xiao, Leung Henry C M, Wang Rongjie, Chin Francis Y L, Yiu Siu Ming, Quan Guangri, Li Yajie, Zhang Rui, Jiang Qinghua, Liu Bo, Dong Yucui, Zhou Guohui, Wang Yadong
College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China.
Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
BMC Bioinformatics. 2015 Nov 16;16:386. doi: 10.1186/s12859-015-0818-3.
Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence).
We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls.
We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder.
由于高通量测序数据的读长较短,在基因组组装过程中会引入组装错误,这可能会对下游数据分析产生不利影响。已经开发了几种工具来消除这些错误,方法包括:1)将组装好的序列与一些相似的参考基因组进行比较;2)分析与组装序列比对的双末端 reads,并单独确定错误组装序列的不一致特征。然而,前一种方法无法区分目标基因组和参考基因组之间真正的结构变异,而后一种方法可能会有许多假阳性检测结果(正确组装的序列被视为错误组装的序列)。
我们提出了 misFinder 工具,该工具旨在以无偏差的方式高精度地识别组装错误,并在错误组装的位置纠正这些错误,以提高下游分析的组装准确性。它结合了参考(或密切相关的参考)基因组的信息以及与组装序列比对的双末端 reads。通过比较基因组参考和组装序列,可以检测到与结构变异相对应的组装错误和正确组装。然后,通过使用从覆盖度和插入距离一致性派生的多个特征分析比对的双末端 reads,从错误组装的序列中区分出不同类型的组装错误,以获得高置信度的错误调用。
我们在模拟和真实的双末端 reads 数据上测试了 misFinder 的性能,misFinder 给出了准确的错误调用,只有极少数误判。此外,我们进一步将 misFinder 与 QUAST 和 REAPR 进行了比较。misFinder 在以下方面优于 QUAST 和 REAPR:1)识别出更多的真阳性错误组装,假阳性和假阴性极少;2)从错误组装的序列中区分出与结构变异相对应的正确组装。misFinder 可以从 https://github.com/hitbio/misFinder 免费下载。