Roberts Michael, Hunt Brian R, Yorke James A, Bolanos Randall A, Delcher Arthur L
Institute for Physical Science and Technology, University of Maryland, College Park, MD 20742-2431, USA.
J Comput Biol. 2004;11(4):734-52. doi: 10.1089/cmb.2004.11.734.
The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a "read." Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of "overlaps," i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the "UMD Overlapper," can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.
全基因组鸟枪法(WGS)组装技术在确定构成基因组的碱基序列的工作中取得了显著成功。WGS组装始于从基因组中随机选择的大量短片段。片段两端的碱基序列得以确定,尽管并不精确,从而产生了一个称为“读段”的字母序列。读段中的每个字母都被赋予一个质量值,该质量值估计在确定该字母时发生测序错误的概率。读段通常在大约500个字母后截断,因为此时测序错误变得普遍。我们报告了一组程序,这些程序(1)纠正了大部分测序错误,(2)相应地改变了质量值,并且(3)生成了一个“重叠”列表,即可能来自基因组重叠部分的读段对。我们的程序统称为“UMD重叠器”,可以迭代运行,并作为其他组装器的预处理器。我们在赛雷拉公司的果蝇读段上测试了UMD重叠器。当我们在其组装器前端替换赛雷拉公司的重叠程序时,它能够生成显著改进的基因组。