Clift B, Haussler D, McConnell R, Schneider T D, Stormo G D
Nucleic Acids Res. 1986 Jan 10;14(1):141-58. doi: 10.1093/nar/14.1.141.
We describe a method for representing the structure of repeating sequences in nucleic-acids, proteins and other texts. A portion of the sequence is presented at the bottom of a CRT screen. Above the sequence is its landscape, which looks like a mountain range. Each mountain corresponds to a subsequence of the sequence. At the peak of every mountain is written the number of times that the subsequence appears. A data structure called a DAWG, which can be built in time proportional to the length of the sequence, is used to construct the landscape. For the 40 thousand bases of bacteriophage T7, the DAWG can be built in 30 seconds. The time to display any portion of the landscape is less than a second. Using sequence landscapes, one can quickly locate significant repeats.
我们描述了一种用于表示核酸、蛋白质及其他文本中重复序列结构的方法。序列的一部分显示在阴极射线管(CRT)屏幕底部。序列上方是其景观图,看起来像山脉。每座山对应序列的一个子序列。在每座山的山顶写着该子序列出现的次数。一种名为有向无环字图(DAWG)的数据结构可用于构建景观图,构建时间与序列长度成正比。对于噬菌体T7的4万个碱基,构建DAWG只需30秒。显示景观图任何部分的时间不到一秒。使用序列景观图,人们可以快速定位重要的重复序列。