检测长读重叠中的创新与挑战：对当前技术水平的评估

Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art.

作者信息

Chu Justin, Mohamadi Hamid, Warren René L, Yang Chen, Birol Inanç

机构信息

University of British Columbia, Vancouver, BC V6T 1Z4, Canada.

Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada.

出版信息

Bioinformatics. 2017 Apr 15;33(8):1261-1270. doi: 10.1093/bioinformatics/btw811.

DOI:10.1093/bioinformatics/btw811

PMID:28003261

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5408847/

Abstract

UNLABELLED

Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput.

CONTACT

cjustin@bcgsc.ca , ibirol@bcgsc.ca.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

未标注

识别易出错长读段之间的重叠区域，特别是来自牛津纳米孔技术公司（ONT）和太平洋生物科学公司（PB）的读段，对于某些下游应用至关重要，包括纠错和从头组装。虽然类似于读段与参考序列比对问题，但读段与读段重叠检测是一个独特的问题，可受益于在高错误率长读段上高效且稳健运行的专门算法。在此，我们综述了用于易出错长读段的当前最先进的读段与读段重叠工具，包括BLASR、DALIGNER、MHAP、GraphMap和Minimap。这些专门的生物信息学工具不仅在算法设计和方法上有所不同，而且在各种数据集上的性能稳健性、时间和内存效率以及可扩展性方面也存在差异。我们强调了这些工具的算法特征，以及在使用任何特定方法时它们可能存在的问题和偏差。为了补充我们对算法的综述，我们对这些工具进行了基准测试，跟踪它们的资源需求和计算性能，并评估了每个工具的特异性和精确性。在测试的工具版本中，我们观察到Minimap在测试的ONT数据集上是计算效率最高、最具特异性和敏感性的方法；而GraphMap和DALIGNER在测试的PB数据集上是最具特异性和敏感性的方法。随着测序通量的增加，可扩展性变得越来越重要，本文所探讨的概念可能适用于未来的测序技术。

联系方式

cjustin@bcgsc.ca，ibirol@bcgsc.ca。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/66ec/5408847/7ec6bd240572/btw811f1.jpg

相似文献

Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art.检测长读重叠中的创新与挑战：对当前技术水平的评估

Bioinformatics. 2017 Apr 15;33(8):1261-1270. doi: 10.1093/bioinformatics/btw811.

HISEA: HIerarchical SEed Aligner for PacBio data.HISEA：用于PacBio数据的分层种子比对器。

BMC Bioinformatics. 2017 Dec 19;18(1):564. doi: 10.1186/s12859-017-1953-9.

Evaluation of tools for long read RNA-seq splice-aware alignment.长读 RNA-seq 剪接感知比对工具评估。

Bioinformatics. 2018 Mar 1;34(5):748-754. doi: 10.1093/bioinformatics/btx668.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

Improved assembly of noisy long reads by k-mer validation.通过k-mer验证改进嘈杂长读段的组装。

Genome Res. 2016 Dec;26(12):1710-1720. doi: 10.1101/gr.209247.116. Epub 2016 Oct 7.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.Minimap和miniasm：用于有噪声长序列的快速映射和从头组装。

Bioinformatics. 2016 Jul 15;32(14):2103-10. doi: 10.1093/bioinformatics/btw152. Epub 2016 Mar 19.

NanoSim: nanopore sequence read simulator based on statistical characterization.NanoSim：基于统计特征的纳米孔序列读取模拟器。

Gigascience. 2017 Apr 1;6(4):1-6. doi: 10.1093/gigascience/gix010.

Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications.利用直系同源序列变异进行敏感比对可提高大片段重复区域的长读长序列比对和变异calling 效率。

Nucleic Acids Res. 2020 Nov 4;48(19):e114. doi: 10.1093/nar/gkaa829.

Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph.使用变阶 de Bruijn 图对高度嘈杂的长读进行混合纠错。

Bioinformatics. 2018 Dec 15;34(24):4213-4222. doi: 10.1093/bioinformatics/bty521.

RepLong: de novo repeat identification using long read sequencing data.RepLong：利用长读测序数据进行从头重复识别。

Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.

引用本文的文献

Application of third-generation sequencing in cancer research.第三代测序技术在癌症研究中的应用。

Med Rev (2021). 2021 Oct 21;1(2):150-171. doi: 10.1515/mr-2021-0013. eCollection 2021 Dec.

Cochlear Development; New Tools and Approaches.耳蜗发育；新工具与新方法

Front Cell Dev Biol. 2022 Jun 23;10:884240. doi: 10.3389/fcell.2022.884240. eCollection 2022.

Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes.比较长读测序技术在细菌和果蝇基因组分析中的应用。

G3 (Bethesda). 2021 Jun 17;11(6). doi: 10.1093/g3journal/jkab083.

Benchmarking of long-read correction methods.长读长校正方法的基准测试。

NAR Genom Bioinform. 2020 May 25;2(2):lqaa037. doi: 10.1093/nargab/lqaa037. eCollection 2020 Jun.

INDEL detection, the 'Achilles heel' of precise genome editing: a survey of methods for accurate profiling of gene editing induced indels.INDEL 检测是精确基因组编辑的“阿喀琉斯之踵”：基因编辑诱导 INDEL 精确分析方法综述。

Nucleic Acids Res. 2020 Dec 2;48(21):11958-11981. doi: 10.1093/nar/gkaa975.

Long-read human genome sequencing and its applications.长读长基因组测序及其应用。

Nat Rev Genet. 2020 Oct;21(10):597-614. doi: 10.1038/s41576-020-0236-x. Epub 2020 Jun 5.

Calling Variants in the Clinic: Informed Variant Calling Decisions Based on Biological, Clinical, and Laboratory Variables.临床中的变异检测：基于生物学、临床和实验室变量做出明智的变异检测决策

Comput Struct Biotechnol J. 2019 Apr 8;17:561-569. doi: 10.1016/j.csbj.2019.04.002. eCollection 2019.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

How Long Are Long Tandem Repeats? A Challenge for Current Methods of Whole-Genome Sequence Assembly: The Case of Satellites in .长串联重复序列有多长？对当前全基因组序列组装方法的挑战：以……中的卫星序列为例

Genes (Basel). 2018 Oct 16;9(10):500. doi: 10.3390/genes9100500.

De novo clustering of long reads by gene from transcriptomics data.基于转录组学数据的基因从头聚类长读长。

Nucleic Acids Res. 2019 Jan 10;47(1):e2. doi: 10.1093/nar/gky834.

本文引用的文献

DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads.DeepNano：用于MinION纳米孔测序读数碱基识别的深度循环神经网络

PLoS One. 2017 Jun 5;12(6):e0178751. doi: 10.1371/journal.pone.0178751. eCollection 2017.

NanoSim: nanopore sequence read simulator based on statistical characterization.NanoSim：基于统计特征的纳米孔序列读取模拟器。

Gigascience. 2017 Apr 1;6(4):1-6. doi: 10.1093/gigascience/gix010.

Nanocall: an open source basecaller for Oxford Nanopore sequencing data.Nanocall：一款用于牛津纳米孔测序数据的开源碱基识别器。

Bioinformatics. 2017 Jan 1;33(1):49-55. doi: 10.1093/bioinformatics/btw569. Epub 2016 Sep 10.

Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads.用于纳米孔读数从头组装的混合与非混合方法评估

Bioinformatics. 2016 Sep 1;32(17):2582-9. doi: 10.1093/bioinformatics/btw237. Epub 2016 May 9.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.Minimap和miniasm：用于有噪声长序列的快速映射和从头组装。

Bioinformatics. 2016 Jul 15;32(14):2103-10. doi: 10.1093/bioinformatics/btw152. Epub 2016 Mar 19.

Fast and sensitive mapping of nanopore sequencing reads with GraphMap.使用GraphMap对纳米孔测序读数进行快速灵敏的映射

Nat Commun. 2016 Apr 15;7:11307. doi: 10.1038/ncomms11307.

Assessing the performance of the Oxford Nanopore Technologies MinION.评估牛津纳米孔技术公司的MinION测序仪的性能。

Biomol Detect Quantif. 2015 Mar;3:1-8. doi: 10.1016/j.bdq.2015.02.001.

Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome.牛津纳米孔测序、混合纠错及真核生物基因组的从头组装

Genome Res. 2015 Nov;25(11):1750-6. doi: 10.1101/gr.191395.115. Epub 2015 Oct 7.

Best Practices in Insect Genome Sequencing: What Works and What Doesn't.昆虫基因组测序的最佳实践：哪些方法可行，哪些不可行。

Curr Opin Insect Sci. 2015 Feb 1;7:1-7. doi: 10.1016/j.cois.2015.02.013.

A complete bacterial genome assembled de novo using only nanopore sequencing data.仅使用纳米孔测序数据从头组装完整的细菌基因组。

Nat Methods. 2015 Aug;12(8):733-5. doi: 10.1038/nmeth.3444. Epub 2015 Jun 15.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

检测长读重叠中的创新与挑战：对当前技术水平的评估

Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art.

作者信息

机构信息

出版信息

UNLABELLED

CONTACT

SUPPLEMENTARY INFORMATION

未标注

联系方式

补充信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献