Liu Yuansheng, Lan Chaowang, Blumenstein Michael, Li Jinyan
IEEE/ACM Trans Comput Biol Bioinform. 2020 May-June;17(3):899-905. doi: 10.1109/TCBB.2017.2780832. Epub 2017 Dec 7.
The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid -mers iteratively with an increasing length of -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio () and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.
最新的测序技术,如太平洋生物科学公司(PacBio)和牛津纳米孔测序仪,能够生成长度达数千个核酸碱基的长读段,这比Illumina测序仪生成的数百个碱基长度的读段长得多。然而,这些长读段的错误率要高得多,例如15%,这使得下游分析和应用变得非常困难。纠错是提高测序数据质量的过程。最近有人提出了混合纠错策略,即将低错误率的Illumina读段结合起来,以修复噪声较大的长读段中的测序错误,且性能良好。在本文中,我们提出了一种名为Bicolor的新方法,这是一种用于进一步提高PacBio长读段质量的双层混合纠错框架。在第一层,我们的方法使用基于de Bruijn图的纠错思想,以不断增加的k-mer长度,在成对的可靠k-mer中迭代搜索路径。在第二层,我们将第一层在不同参数下的处理结果进行合并。具体而言,使用多重序列比对算法对那些相似的长读段进行比对,随后使用投票算法确定读段每个位置的最终碱基。我们在三个真实数据集上,将Bicolor的卓越性能与三种最先进的方法进行了比较。结果表明,Bicolor始终能实现最高的一致性比率。在两个数据集上,Bicolor的比对率()和比对读段数量也高于当前方法。在第三个数据集上,我们的方法在比对读段数量和基因组覆盖率方面与当前方法竞争激烈。我们算法的C++源代码可在https://github.com/yuansliu/Bicolor上免费获取。