MEC:基于双端读段分布和GC含量统计的重叠群错配错误校正
MEC: Misassembly Error Correction in contigs based on distribution of paired-end reads and statistics of GC-contents.
作者信息
Wu Binbin, Li Min, Liao Xingyu, Luo Junwei, Wu Fangxiang, Pan Yi, Wang Jianxin
出版信息
IEEE/ACM Trans Comput Biol Bioinform. 2018 Oct 18. doi: 10.1109/TCBB.2018.2876855.
The de novo assembly tools aim at reconstructing genomes from next-generation sequencing (NGS) data. However, the assembly tools usually generate a large amount of contigs containing many misassemblies, which are caused by problems of repetitive regions, chimeric reads and sequencing errors. As they can improve the accuracy of assembly results, detecting and correcting the misassemblies in contigs are appealing, yet challenging. In this study, a novel method, called MEC, is proposed to identify and correct misassemblies in contigs. Based on the insert size distribution of paired-end reads and the statistical analysis of GC-contents, MEC can identify more misassemblies accurately. We evaluate our MEC with the metrics (NA50, NGA50) on four datasets, compared it with the most available misassembly correction tools, and carry out experiments to analyze the influence of MEC on scaffolding results, which shows that MEC can reduce misassemblies effectively and result in quantitative improvements in scaffolding quality. MEC is publicly available at https://github.com/bioinfomaticsCSU/MEC.
从头组装工具旨在从下一代测序(NGS)数据中重建基因组。然而,组装工具通常会生成大量包含许多错误组装的重叠群,这些错误组装是由重复区域、嵌合读段和测序错误等问题引起的。由于检测和纠正重叠群中的错误组装可以提高组装结果的准确性,因此这一工作很有吸引力,但也具有挑战性。在本研究中,我们提出了一种名为MEC的新方法来识别和纠正重叠群中的错误组装。基于双端读段的插入片段大小分布和GC含量的统计分析,MEC能够更准确地识别更多错误组装。我们在四个数据集上使用指标(NA50、NGA50)评估了我们的MEC,将其与最常用的错误组装校正工具进行比较,并进行实验分析MEC对支架搭建结果的影响,结果表明MEC可以有效减少错误组装,并在支架搭建质量上带来定量提升。MEC可在https://github.com/bioinfomaticsCSU/MEC上公开获取。