Kimura Kouichi, Koike Asako
Central Research Laboratory, Hitachi Ltd, 1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, Japan.
Genome Inform. 2009 Oct;23(1):60-71.
We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.
我们引入了一种新的数据结构——局部后缀数组,在文本搜索应用中,基于该结构,出现信息被动态表示为全局位置信息和局部字典序信息的组合。对于在给定距离内搜索一对单词的情况,许多共享粗粒度全局位置的候选位置可以像在传统后缀数组中那样,根据局部字典序进行紧凑表示,并且可以在粗粒度分辨率下同时检查它们是否违反距离约束。位置信息和字典序信息之间的权衡逐渐向更精细的位置分辨率转移,并相应地重新检查距离约束。因此,即使每个单词有大量出现情况,配对搜索也能高效执行。局部后缀数组本身实际上是传统后缀数组内部位的重新排序,它们的内存需求基本相同。我们展示了其在新一代DNA测序仪生成的双端短读段的基因组映射问题中的应用。当双端读段高度重复时,单纯计算、排序和比较所有坐标会很耗时。对于一个36个碱基对的人类基因组重测序数据,在几乎一半的双端读段冗余度(单个出现次数)总和大于2000的情况下,观察到比单纯方法快10倍以上的加速。