misFinder：使用参考序列和双端读段以无偏倚的方式识别错误组装。

misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

作者信息

Zhu Xiao, Leung Henry C M, Wang Rongjie, Chin Francis Y L, Yiu Siu Ming, Quan Guangri, Li Yajie, Zhang Rui, Jiang Qinghua, Liu Bo, Dong Yucui, Zhou Guohui, Wang Yadong

机构信息

College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China.

Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.

出版信息

BMC Bioinformatics. 2015 Nov 16;16:386. doi: 10.1186/s12859-015-0818-3.

DOI:10.1186/s12859-015-0818-3

PMID:26573684

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4647709/

Abstract

BACKGROUND

Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence).

RESULTS

We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls.

CONCLUSIONS

We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder.

摘要

背景

由于高通量测序数据的读长较短，在基因组组装过程中会引入组装错误，这可能会对下游数据分析产生不利影响。已经开发了几种工具来消除这些错误，方法包括：1）将组装好的序列与一些相似的参考基因组进行比较；2）分析与组装序列比对的双末端 reads，并单独确定错误组装序列的不一致特征。然而，前一种方法无法区分目标基因组和参考基因组之间真正的结构变异，而后一种方法可能会有许多假阳性检测结果（正确组装的序列被视为错误组装的序列）。

结果

我们提出了 misFinder 工具，该工具旨在以无偏差的方式高精度地识别组装错误，并在错误组装的位置纠正这些错误，以提高下游分析的组装准确性。它结合了参考（或密切相关的参考）基因组的信息以及与组装序列比对的双末端 reads。通过比较基因组参考和组装序列，可以检测到与结构变异相对应的组装错误和正确组装。然后，通过使用从覆盖度和插入距离一致性派生的多个特征分析比对的双末端 reads，从错误组装的序列中区分出不同类型的组装错误，以获得高置信度的错误调用。

结论

我们在模拟和真实的双末端 reads 数据上测试了 misFinder 的性能，misFinder 给出了准确的错误调用，只有极少数误判。此外，我们进一步将 misFinder 与 QUAST 和 REAPR 进行了比较。misFinder 在以下方面优于 QUAST 和 REAPR：1）识别出更多的真阳性错误组装，假阳性和假阴性极少；2）从错误组装的序列中区分出与结构变异相对应的正确组装。misFinder 可以从 https://github.com/hitbio/misFinder 免费下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a2fb/4647709/b3200a28aedd/12859_2015_818_Fig1_HTML.jpg

相似文献

misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

BMC Bioinformatics. 2015 Nov 16;16:386. doi: 10.1186/s12859-015-0818-3.

NucBreak: location of structural errors in a genome assembly by using paired-end Illumina reads.

BMC Bioinformatics. 2020 Feb 21;21(1):66. doi: 10.1186/s12859-020-3414-0.

Tigmint: correcting assembly errors using linked reads from large molecules.

BMC Bioinformatics. 2018 Oct 26;19(1):393. doi: 10.1186/s12859-018-2425-6.

Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework.

BMC Genomics. 2015;16 Suppl 12(Suppl 12):S9. doi: 10.1186/1471-2164-16-S12-S9. Epub 2015 Dec 9.

GAPPadder: a sensitive approach for closing gaps on draft genomes with short sequence reads.

BMC Genomics. 2019 Jun 6;20(Suppl 5):426. doi: 10.1186/s12864-019-5703-4.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

Identifying wrong assemblies in de novo short read primary sequence assembly contigs.

J Biosci. 2016 Sep;41(3):455-74. doi: 10.1007/s12038-016-9630-0.

Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case.

BMC Genomics. 2018 Dec 29;19(1):977. doi: 10.1186/s12864-018-5348-8.

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

BMC Bioinformatics. 2014 Nov 19;15(1):357. doi: 10.1186/s12859-014-0357-3.

SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies.

BMC Genomics. 2019 Apr 18;19(Suppl 9):238. doi: 10.1186/s12864-019-5445-3.

引用本文的文献

ASVBM: Structural variant benchmarking with local joint analysis for multiple callsets.

Comput Struct Biotechnol J. 2025 Jun 29;27:2851-2862. doi: 10.1016/j.csbj.2025.06.045. eCollection 2025.

Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs.

Mol Ecol Resour. 2025 Jan;25(1):e13982. doi: 10.1111/1755-0998.13982. Epub 2024 May 27.

metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies.

Genome Biol. 2022 Nov 14;23(1):242. doi: 10.1186/s13059-022-02810-y.

Genome sequence assembly algorithms and misassembly identification methods.

Mol Biol Rep. 2022 Nov;49(11):11133-11148. doi: 10.1007/s11033-022-07919-8. Epub 2022 Sep 23.

LongStitch: high-quality genome assembly correction and scaffolding using long reads.

BMC Bioinformatics. 2021 Oct 30;22(1):534. doi: 10.1186/s12859-021-04451-7.

SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes.

PLoS Comput Biol. 2020 Dec 4;16(12):e1008439. doi: 10.1371/journal.pcbi.1008439. eCollection 2020 Dec.

PACVr: plastome assembly coverage visualization in R.

BMC Bioinformatics. 2020 May 24;21(1):207. doi: 10.1186/s12859-020-3475-0.

SKESA: strategic k-mer extension for scrupulous assemblies.

Genome Biol. 2018 Oct 4;19(1):153. doi: 10.1186/s13059-018-1540-z.

misMM: An Integrated Pipeline for Misassembly Detection Using Genotyping-by-Sequencing and Its Validation with BAC End Library Sequences and Gene Synteny.

Genomics Inform. 2017 Dec;15(4):128-135. doi: 10.5808/GI.2017.15.4.128. Epub 2017 Dec 29.

Reference genome assessment from a population scale perspective: an accurate profile of variability and noise.

Bioinformatics. 2017 Nov 15;33(22):3511-3517. doi: 10.1093/bioinformatics/btx482.

本文引用的文献

PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach.

PLoS One. 2014 Dec 2;9(12):e114253. doi: 10.1371/journal.pone.0114253. eCollection 2014.

LUMPY: a probabilistic framework for structural variant discovery.

Genome Biol. 2014 Jun 26;15(6):R84. doi: 10.1186/gb-2014-15-6-r84.

The MaSuRCA genome assembler.

Bioinformatics. 2013 Nov 1;29(21):2669-77. doi: 10.1093/bioinformatics/btt476. Epub 2013 Aug 29.

REAPR: a universal tool for genome assembly evaluation.

Genome Biol. 2013 May 27;14(5):R47. doi: 10.1186/gb-2013-14-5-r47.

QUAST: quality assessment tool for genome assemblies.

Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19.

CGAL: computing genome assembly likelihoods.

Genome Biol. 2013 Jan 29;14(1):R8. doi: 10.1186/gb-2013-14-1-r8.

ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies.

Bioinformatics. 2013 Feb 15;29(4):435-43. doi: 10.1093/bioinformatics/bts723. Epub 2013 Jan 9.

BreakFusion: targeted assembly-based identification of gene fusions in whole transcriptome paired-end sequencing data.

Bioinformatics. 2012 Jul 15;28(14):1923-4. doi: 10.1093/bioinformatics/bts272. Epub 2012 May 4.

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.

Brief Bioinform. 2013 Mar;14(2):178-92. doi: 10.1093/bib/bbs017. Epub 2012 Apr 19.

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Bioinformatics. 2012 Jun 1;28(11):1420-8. doi: 10.1093/bioinformatics/bts174. Epub 2012 Apr 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

misFinder：使用参考序列和双端读段以无偏倚的方式识别错误组装。

misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

作者信息

Zhu Xiao, Leung Henry C M, Wang Rongjie, Chin Francis Y L, Yiu Siu Ming, Quan Guangri, Li Yajie, Zhang Rui, Jiang Qinghua, Liu Bo, Dong Yucui, Zhou Guohui, Wang Yadong

机构信息

College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China.

Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.