Suppr超能文献

检测长读重叠中的创新与挑战:对当前技术水平的评估

Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art.

作者信息

Chu Justin, Mohamadi Hamid, Warren René L, Yang Chen, Birol Inanç

机构信息

University of British Columbia, Vancouver, BC V6T 1Z4, Canada.

Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada.

出版信息

Bioinformatics. 2017 Apr 15;33(8):1261-1270. doi: 10.1093/bioinformatics/btw811.

Abstract

UNLABELLED

Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput.

CONTACT

cjustin@bcgsc.ca , ibirol@bcgsc.ca.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

未标注

识别易出错长读段之间的重叠区域,特别是来自牛津纳米孔技术公司(ONT)和太平洋生物科学公司(PB)的读段,对于某些下游应用至关重要,包括纠错和从头组装。虽然类似于读段与参考序列比对问题,但读段与读段重叠检测是一个独特的问题,可受益于在高错误率长读段上高效且稳健运行的专门算法。在此,我们综述了用于易出错长读段的当前最先进的读段与读段重叠工具,包括BLASR、DALIGNER、MHAP、GraphMap和Minimap。这些专门的生物信息学工具不仅在算法设计和方法上有所不同,而且在各种数据集上的性能稳健性、时间和内存效率以及可扩展性方面也存在差异。我们强调了这些工具的算法特征,以及在使用任何特定方法时它们可能存在的问题和偏差。为了补充我们对算法的综述,我们对这些工具进行了基准测试,跟踪它们的资源需求和计算性能,并评估了每个工具的特异性和精确性。在测试的工具版本中,我们观察到Minimap在测试的ONT数据集上是计算效率最高、最具特异性和敏感性的方法;而GraphMap和DALIGNER在测试的PB数据集上是最具特异性和敏感性的方法。随着测序通量的增加,可扩展性变得越来越重要,本文所探讨的概念可能适用于未来的测序技术。

联系方式

cjustin@bcgsc.caibirol@bcgsc.ca

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/66ec/5408847/7ec6bd240572/btw811f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验