Department of Computational Biology, Max Planck Institute for Molecular Genetics, Berlin D-14195, Germany.
Genome Res. 2011 Mar;21(3):487-93. doi: 10.1101/gr.113985.110. Epub 2011 Jan 5.
The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.
分析生物序列的主要方法是将它们相互比较和对齐。然而,要比较现代的数十亿碱基对 DNA 数据集仍然很困难。这种困难是由这些序列的非均匀(寡)核苷酸组成引起的,而不是它们的大小本身。为了解决这个问题,我们修改了标准的种子和扩展方法(例如 BLAST)来使用自适应种子。自适应种子是根据它们的稀有性而不是使用固定长度的匹配来选择的匹配。这种方法保证了匹配的数量,从而使运行时间随着序列长度的增加而线性增加,而不是二次增加。LAST,我们的自适应种子的开源实现,使具有任意非均匀组成的大序列的快速和敏感比较成为可能。