Agarwal P, States D J
Institute for Biomedical Computing, Washington University, St. Louis, MO 63110, USA.
J Comput Biol. 1996 Spring;3(1):1-17. doi: 10.1089/cmb.1996.3.1.
There is an inherent relationship between the process of pairwise sequence alignment and the estimation of evolutionary distance. This relationship is explored and made explicit. Assuming an evolutionary model and given a specific pattern of observed base mismatches, the relative probabilities of evolution at each evolutionary distance are computed using a Bayesian framework. The mean or the median of this probability distribution provides a robust estimate of the central value. The evolutionary distance has traditionally been computed as zero for an observed homology of 20 bases with no mismatches; we prove that it is highly probable that the distance is greater than 0.01. The mean of the distribution is 0.047, which is a better estimate of the evolutionary distance. Bayesian estimates of the evolutionary distance incorporate arbitrary prior information about variable mutation rates both over time and along sequence position, thus requiring only a weak form of the molecular-clock hypothesis. The endpoints of the similarity between genomic DNA sequences are often ambiguous. The probability of evolution at each evolutionary distance can be estimated over the entire set of alignments by choosing the best alignment at each distance and the corresponding probability of duplication at that evolutionary distance. A central value of this distribution provides a robust evolutionary distance estimate. We provide an efficient algorithm for computing the parametric alignment, considering evolutionary distance as the only parameter. These techniques and estimates are used to infer the duplication history of the genomic sequence in C. elegans and in S. cerevisiae. Our results indicate that repeats discovered using a single scoring matrix show a considerable bias in subsequent evolutionary distance estimates.
成对序列比对过程与进化距离估计之间存在内在联系。本文对这种联系进行了探索并使之明确。假设一个进化模型,并给定观察到的碱基错配的特定模式,使用贝叶斯框架计算每个进化距离处进化的相对概率。该概率分布的均值或中位数提供了对中心值的稳健估计。传统上,对于观察到的20个碱基的同源性且无错配的情况,进化距离计算为零;我们证明距离大于0.01的可能性非常高。分布的均值为0.047,这是对进化距离的更好估计。进化距离的贝叶斯估计纳入了关于随时间和沿序列位置的可变突变率的任意先验信息,因此只需要分子钟假设的弱形式。基因组DNA序列之间相似性的端点通常不明确。通过在每个距离选择最佳比对以及该进化距离处相应的重复概率,可以在整个比对集中估计每个进化距离处的进化概率。该分布的中心值提供了稳健的进化距离估计。我们提供了一种有效的算法来计算参数比对,将进化距离视为唯一参数。这些技术和估计用于推断秀丽隐杆线虫和酿酒酵母中基因组序列的重复历史。我们的结果表明,使用单一评分矩阵发现的重复在随后的进化距离估计中显示出相当大的偏差。