Huang Xiaoqiu, Yang Shiaw-Pyng, Chinwalla Asif T, Hillier LaDeana W, Minx Patrick, Mardis Elaine R, Wilson Richard K
Department of Computer Science, Iowa State University, Ames, IA 50011-1040, USA.
Nucleic Acids Res. 2006 Jan 5;34(1):201-5. doi: 10.1093/nar/gkj419. Print 2006.
We introduce a data structure called a superword array for finding quickly matches between DNA sequences. The superword array possesses some desirable features of the lookup table and suffix array. We describe simple algorithms for constructing and using a superword array to find pairs of sequences that share a unique superword. The algorithms are implemented in a genome assembly program called PCAP.REP for computation of overlaps between reads. Experimental results produced by PCAP.REP and PCAP on a whole-genome dataset show that PCAP.REP produced a more accurate and contiguous assembly than PCAP.
我们引入一种名为超字数组的数据结构,用于快速查找DNA序列之间的匹配项。超字数组具备查找表和后缀数组的一些理想特性。我们描述了构建和使用超字数组以找到共享唯一超字的序列对的简单算法。这些算法在一个名为PCAP.REP的基因组组装程序中实现,用于计算读段之间的重叠。PCAP.REP和PCAP在一个全基因组数据集上产生的实验结果表明,PCAP.REP比PCAP产生了更准确和连续的组装。