Park Yonil, Sheetlin Sergey, Spouge John L
National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, Maryland 20894 USA.
Ann Stat. 2009 Dec 1;37(6A):3697. doi: 10.1214/08-AOS663.
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.
两个随机序列的间隙局部比对得分服从耿贝尔分布。如果计算机能够在一秒内估计出耿贝尔分布的参数,那么使用任意比对计分方案都可以提高在网络上搜索生物序列数据库的灵敏度。因此,本文给出了一个关于相关耿贝尔分布尺度参数的新方程。我们推测该方程是精确的,尽管目前的数值证据有限。该方程涉及随机序列全局比对中的上升阶梯变量。在全局比对模拟中,阶梯变量产生指定随机序列长度的停止时间。由于序列长度是随机的,并且由于我们用于重要性抽样的试验分布发生在与目标分布不同的样本空间上,我们的研究得出了一个映射定理,进而自然地引出了一种用于重要性抽样权重的高效动态规划算法。然后,使用几种流行的比对计分方案进行的数值研究检验了所得模拟的效率和准确性。