可扩展的长读自我纠错和多重序列比对的组装优化。

Scalable long read self-correction and assembly polishing with multiple sequence alignment.

机构信息

Univ Rennes, Inria, CNRS, IRISA, 35000, Rennes, France.

Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59000, Lille, France.

出版信息

Sci Rep. 2021 Jan 12;11(1):761. doi: 10.1038/s41598-020-80757-5.

DOI:10.1038/s41598-020-80757-5

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7804095/

Abstract

Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .

摘要

第三代测序技术可以对长达数十千碱基对的长读段进行测序，有望解决各种问题。然而，它们的错误率较高，目前的错误率约为 10%。因此，在长读段分析项目中经常会使用自我纠错。我们介绍了 CONSENT，这是一种新的自我纠错方法，它同时依赖于多序列比对和局部 de Bruijn 图。为了确保可扩展性，多序列比对计算得益于一种新的高效分段策略，从而实现了大规模的加速。CONSENT 与最先进的方法相比表现良好，在真实的 Oxford Nanopore 数据上表现更好。具体来说，CONSENT 是唯一一种能够高效处理超长读段的方法，并且能够在 10 天内处理完整的人类数据集，其中包含长达 1.5 Mbp 的读段。此外，我们的实验表明，使用 CONSENT 进行错误纠正可以提高 Flye 组装的质量。此外，CONSENT 实现了一种抛光功能，允许纠正原始组装。我们的实验表明，CONSENT 比其他抛光工具快 2-38 倍，同时提供了可比的结果。此外，我们表明，在人类数据集上，组装原始数据并对组装进行抛光比纠正读取然后再组装消耗的资源更少，同时提供了更好的结果。CONSENT 可在 https://github.com/morispi/CONSENT 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cd8/7804095/1798ea77eff1/41598_2020_80757_Fig1_HTML.jpg