Zimin Aleksey V, Puiu Daniela, Luo Ming-Cheng, Zhu Tingting, Koren Sergey, Marçais Guillaume, Yorke James A, Dvořák Jan, Salzberg Steven L
Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.
Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA.
Genome Res. 2017 May;27(5):787-792. doi: 10.1101/gr.213405.116. Epub 2017 Jan 27.
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct , which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species , a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
单分子测序技术生成的长测序读段为大幅提高基因组组装的连续性提供了可能。如今最大的挑战在于长读段的错误率相对较高,目前约为15%。如此高的错误率使得难以单独使用这些数据,尤其是对于高度重复的植物基因组。原始数据中的错误可能导致一致性基因组序列中出现插入或缺失错误(插入缺失),进而给下游分析带来重大问题;例如,单个插入缺失可能会改变阅读框并错误地截断蛋白质序列。在此,我们描述一种算法,该算法通过将长的、高错误率的读段与短但准确得多的Illumina测序读段(其错误率平均<1%)相结合来解决高错误率问题。我们的混合组装算法将这两种类型的读段结合起来构建既长又准确的超级读段,然后使用专为长读段设计的CABOG组装器来组装这些超级读段。我们将此技术应用于来自某物种的Illumina和PacBio序列的大数据集,该物种是一个大型且极度重复的植物基因组,此前的组装尝试均未成功。我们表明,最终得到的组装重叠群远比之前的任何组装结果大,N50重叠群大小为486,807个核苷酸。我们将这些重叠群与独立生成的光学图谱进行比较以评估其大规模准确性,并与一组基于高质量细菌人工染色体(BAC)的组装结果进行比较以评估碱基水平的准确性。