Schneider Thomas D, Mastronarde David N
National Cancer Institute, Frederick Cancer Research and Development Center, Laboratory of Mathematical Biology, P. O. Box B, Frederick, MD 21702-1201.
Discrete Appl Math. 1996 Dec 1;71(1-3):259-268. doi: 10.1016/S0166-218X(96)00068-6.
An information theory based multiple alignment ("Malign") method was used to align the DNA binding sequences of the OxyR and Fis proteins, whose sequence conservation is so spread out that it is difficult to identify the sites. In the algorithm described here, the information content of the sequences is used as a unique global criterion for the quality of the alignment. The algorithm uses look-up tables to avoid recalculating computationally expensive functions such as the logarithm. Because there are no arbitrary constants and because the results are reported in absolute units (bits), the best alignment can be chosen without ambiguity. Starting from randomly selected alignments, a hill-climbing algorithm can track through the immense space of s(n) combinations where s is the number of sequences and n is the number of positions possible for each sequence. Instead of producing a single alignment, the algorithm is fast enough that one can afford to use many start points and to classify the solutions. Good convergence is indicated by the presence of a single well-populated solution class having higher information content than other classes. The existence of several distinct classes for the Fis protein indicates that those binding sites have self-similar features.
一种基于信息论的多重比对(“Malign”)方法被用于比对OxyR和Fis蛋白的DNA结合序列,这些序列的保守性分布得非常分散,以至于难以识别位点。在此描述的算法中,序列的信息含量被用作比对质量的唯一全局标准。该算法使用查找表来避免重新计算计算成本高昂的函数,如对数函数。由于没有任意常数,并且结果以绝对单位(比特)报告,因此可以明确无误地选择最佳比对。从随机选择的比对开始,爬山算法可以在s(n)组合的巨大空间中进行跟踪,其中s是序列的数量,n是每个序列可能的位置数量。该算法不是产生单个比对,而是速度足够快,以至于可以使用许多起始点并对解决方案进行分类。单个信息含量高于其他类别的密集填充的解决方案类别的存在表明收敛良好。Fis蛋白存在几个不同的类别,这表明那些结合位点具有自相似特征。