Yu Y K, Hwa T
Department of Physics, Florida Atlantic University, Boca Raton, FL 33431-0991, USA.
J Comput Biol. 2001;8(3):249-82. doi: 10.1089/10665270152530845.
The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the "local" version of maximum-likelihood or hidden Markov model method) is found to have anomalous statistics. A modified "semi-probabilistic" alignment consisting of a hybrid of Smith-Waterman and probabilistic alignment is then proposed and studied in detail. It is predicted that the score statistics of the hybrid algorithm is of the Gumbel universal form, with the key Gumbel parameter lambda taking on a fixed asymptotic value for a wide variety of scoring systems and parameters. A simple recipe for the computation of the "relative entropy," and from it the finite size correction to lambda, is also given. These predictions compare well with direct numerical simulations for sequences of lengths between 100 and 1,000 examined using various PAM substitution scores and affine gap functions. The sensitivity of the hybrid method in the detection of sequence homology is also studied using correlated sequences generated from toy mutation models. It is found to be comparable to that of the Smith-Waterman alignment and significantly better than the Viterbi version of the probabilistic alignment.
对随机序列概率性带间隙局部比对的得分统计进行了分析和数值研究。发现完整的概率算法(例如最大似然或隐马尔可夫模型方法的“局部”版本)具有异常统计特性。然后提出并详细研究了一种由史密斯-沃特曼比对和概率性比对混合而成的改进的“半概率性”比对。预计混合算法的得分统计具有耿贝尔通用形式,对于各种评分系统和参数,关键的耿贝尔参数λ具有固定的渐近值。还给出了计算“相对熵”的简单方法,并由此得到对λ的有限尺寸修正。对于使用各种PAM替换得分和仿射间隙函数检查的长度在100到1000之间的序列,这些预测与直接数值模拟结果吻合良好。还使用从玩具突变模型生成的相关序列研究了混合方法在检测序列同源性方面的灵敏度。发现它与史密斯-沃特曼比对相当,并且明显优于概率性比对的维特比版本。