Software Engineering Research Center, School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China.
Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA.
Bioinformatics. 2018 Jan 1;34(1):24-32. doi: 10.1093/bioinformatics/btx524.
Contigs assembled from the second generation sequencing short reads may contain misassemblies, and thus complicate downstream analysis or even lead to incorrect analysis results. Fortunately, with more and more sequenced species available, it becomes possible to use the reference genome of a closely related species to detect misassemblies. In addition, long reads of the third generation sequencing technology have been more and more widely used, and can also help detect misassemblies.
Here, we introduce ReMILO, a reference assisted misassembly detection algorithm that uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both the contigs and reference genome, and then constructs a novel data structure called red-black multipositional de Bruijn graph to detect misassemblies. In addition, ReMILO also aligns the contigs to long reads and find their differences from the long reads to detect more misassemblies. In our performance test on short read assemblies of human chromosome 14 data, ReMILO can detect 41.8-77.9% extensive misassemblies and 33.6-54.5% local misassemblies. On hybrid short and long read assemblies of S.pastorianus data, ReMILO can also detect 60.6-70.9% extensive misassemblies and 28.6-54.0% local misassemblies.
The ReMILO software can be downloaded for free under Artistic License 2.0 from this site: https://github.com/songc001/remilo.
Supplementary data are available at Bioinformatics online.
由第二代测序短读序列组装的 contigs 可能包含错误组装,从而使下游分析复杂化,甚至导致分析结果错误。幸运的是,随着越来越多的测序物种可用,使用近缘物种的参考基因组来检测错误组装成为可能。此外,第三代测序技术的长读长越来越广泛地被使用,也有助于检测错误组装。
在这里,我们介绍了 ReMILO,一种使用短读长和 PacBio SMRT 长读长的参考辅助错误组装检测算法。ReMILO 将初始短读长与 contigs 和参考基因组进行比对,然后构建一种称为红黑多位置 de Bruijn 图的新数据结构来检测错误组装。此外,ReMILO 还将 contigs 与长读长进行比对,并从长读长中找到它们之间的差异,以检测更多的错误组装。在我们对人类染色体 14 数据的短读长组装的性能测试中,ReMILO 可以检测到 41.8-77.9%的广泛错误组装和 33.6-54.5%的局部错误组装。在 S.pastorianus 的混合短读长和长读长组装中,ReMILO 也可以检测到 60.6-70.9%的广泛错误组装和 28.6-54.0%的局部错误组装。
ReMILO 软件可以在 Artistic License 2.0 下免费从以下网址下载:https://github.com/songc001/remilo。
补充数据可在 Bioinformatics 在线获取。