Olsen R, Bundschuh R, Hwa T
Department of Physics, University of California at San Diego, La Jolla 92093-0319, USA.
Proc Int Conf Intell Syst Mol Biol. 1999:211-22.
The statistical significance of gapped local alignments is characterized by analyzing the extremal statistics of the scores obtained from the alignment of random amino acid sequences. By identifying a complete set of linked clusters, "islands," we devise a method which accurately predicts the extremal score statistics by using only one to a few pairwise alignments. The success of our method relies crucially on the link between the statistics of island scores and extremal score statistics. This link is motivated by heuristic arguments, and firmly established by extensive numerical simulations for a variety of scoring parameter settings and sequence lengths. Our approach is several orders of magnitude faster than the widely used shuffling method, since island counting is trivially incorporated into the basic Smith-Waterman alignment algorithm with minimal computational cost, and all islands are counted in a single alignment. The availability of a rapid and accurate significance estimation method gives one the flexibility to fine tune scoring parameters to detect weakly homologous sequences and obtain optimal alignment fidelity.
通过分析从随机氨基酸序列比对中获得的得分的极值统计,来表征有间隙局部比对的统计学显著性。通过识别一组完整的相连簇,即“岛”,我们设计了一种方法,该方法仅使用一到几个两两比对就能准确预测极值得分统计。我们方法的成功关键依赖于岛得分统计与极值得分统计之间的联系。这种联系是由启发式论证推动的,并通过对各种评分参数设置和序列长度进行广泛的数值模拟得到了有力验证。我们的方法比广泛使用的重排方法快几个数量级,因为岛计数可以以最小的计算成本轻松地纳入基本的史密斯-沃特曼比对算法中,并且所有岛都在一次比对中计数。快速准确的显著性估计方法的可用性使人们能够灵活地微调评分参数,以检测弱同源序列并获得最佳比对保真度。