Carvalho Antonio Bernardo, Dupim Eduardo G, Goldstein Gabriel
Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil.
Genome Res. 2016 Dec;26(12):1710-1720. doi: 10.1101/gr.209247.116. Epub 2016 Oct 7.
Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.
基因组组装严重依赖于读长。太平洋生物科学公司(PacBio)和牛津纳米孔公司的两项最新技术可产生长度超过20 kb的读长,这使得从头基因组组装的连续性比基于桑格、Illumina或其他技术的组装要高得多。然而,这两项新技术的错误率非常高(约每碱基15%),使得在重复序列长度超过读长时组装不准确,并且计算成本很高。在这里,我们表明,通过利用Illumina短读长的低错误率和低成本,这些有噪声的长读长组装的连续性和质量可以以最小的成本得到显著提高。具体来说,PacBio原始读长中不存在于Illumina读长中的k-mer(约占不同k-mer的95%)被视为测序错误,并在种子比对步骤中被忽略。通过关注约占5%的无错误k-mer,读长重叠敏感性显著提高。同样重要的是,验证程序可以扩展到排除重复的k-mer,这可以防止在重复序列处读长的错误校正,并进一步改善最终的组装结果。我们使用一种长读长技术(PacBio)和一种组装器(MHAP/Celera Assembler)测试了k-mer验证程序,但很可能使用替代的长读长技术和组装器(分别如牛津纳米孔和BLASR/DALIGNER/Falcon)也会产生类似的改进。