Suppr超能文献

用于PacBio长读段的双水平错误校正

Bi-level error correction for PacBio long reads.

作者信息

Liu Yuansheng, Lan Chaowang, Blumenstein Michael, Li Jinyan

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2020 May-June;17(3):899-905. doi: 10.1109/TCBB.2017.2780832. Epub 2017 Dec 7.

Abstract

The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid -mers iteratively with an increasing length of -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio () and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.

摘要

最新的测序技术,如太平洋生物科学公司(PacBio)和牛津纳米孔测序仪,能够生成长度达数千个核酸碱基的长读段,这比Illumina测序仪生成的数百个碱基长度的读段长得多。然而,这些长读段的错误率要高得多,例如15%,这使得下游分析和应用变得非常困难。纠错是提高测序数据质量的过程。最近有人提出了混合纠错策略,即将低错误率的Illumina读段结合起来,以修复噪声较大的长读段中的测序错误,且性能良好。在本文中,我们提出了一种名为Bicolor的新方法,这是一种用于进一步提高PacBio长读段质量的双层混合纠错框架。在第一层,我们的方法使用基于de Bruijn图的纠错思想,以不断增加的k-mer长度,在成对的可靠k-mer中迭代搜索路径。在第二层,我们将第一层在不同参数下的处理结果进行合并。具体而言,使用多重序列比对算法对那些相似的长读段进行比对,随后使用投票算法确定读段每个位置的最终碱基。我们在三个真实数据集上,将Bicolor的卓越性能与三种最先进的方法进行了比较。结果表明,Bicolor始终能实现最高的一致性比率。在两个数据集上,Bicolor的比对率()和比对读段数量也高于当前方法。在第三个数据集上,我们的方法在比对读段数量和基因组覆盖率方面与当前方法竞争激烈。我们算法的C++源代码可在https://github.com/yuansliu/Bicolor上免费获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验