Elleman T C
J Mol Evol. 1978 Jun 20;11(2):143-61. doi: 10.1007/BF01733890.
A method for detecting homology between two protein or nucleic acid sequences which require insertions or deletions for optimum alignment has been devised for use with a computer. Sequences are assessed for possible relationship by Monte Carlo methods involving comparisons between the alignment of the real sequences and alignments of randomly scrambled sequences of the same composition as the real sequences, each alignment having the optimum number of gaps. As each gap is successively introduced into a comparison (real or random) a maximum score is determined from the similarity of the aligned residues. From the distribution of the maximum alignment scores of randomly scrambled sequences having the same number of gaps, the percentage of random comparisons having higher scores is determined, and the smallest of these percentage levels for each pair of sequences (real or random) indicates the optimum alignment. The fraction of the comparisons of random sequences having percentage levels at their optimum alignment below that of the real sequence comparison at its optimum estimates the probability that such an alignment might have arisen by chance. Related sequences are detected since their optimum alignment score, by virtue of a contribution from ancestral homology in addition to optimised random considerations, occupies a more extreme position in the appropriate frequency distribution of score than do the majority of optimum scores of randomly scrambled sequences in their appropriate distributions. Application of this 'optimum match' method of sequence comparison shows that the sensitivity of the 'maximum match' method of Needleman and Wunsch (1970) decreases quite dramatically with sequence comparisons which require only a few gaps for a reasonable alignment, or when sequences differ greatly in length. The 'maximum match' method as applied by Barker and Dayhoff (1972) has the additional disadvantage that deletions which have occurred in the longer of two homologous protein sequences further decrease the sensitivity of detection of relationship. The 'constrained match' method of Sankoff and Cedergren (1973) is seen to be misleading since large increments in the alignment score from added gaps do not necessarily result in a high total alignment score required to demonstrate sequence homology.
已设计出一种用于计算机的方法,以检测两个蛋白质或核酸序列之间的同源性,这两个序列需要插入或缺失才能实现最佳比对。通过蒙特卡罗方法评估序列之间的可能关系,该方法涉及真实序列的比对与组成与真实序列相同的随机重排序列的比对之间的比较,每个比对具有最佳数量的空位。当每个空位依次引入比对(真实或随机)时,根据比对残基的相似性确定最大得分。从具有相同空位数量的随机重排序列的最大比对得分分布中,确定得分更高的随机比对的百分比,并且每对序列(真实或随机)的这些百分比水平中最小的表示最佳比对。在其最佳比对时百分比水平低于真实序列比对在其最佳比对时百分比水平的随机序列比对部分,估计了这种比对可能偶然出现的概率。相关序列能够被检测到,因为除了优化的随机考虑之外,由于祖先同源性的贡献,它们的最佳比对得分在得分的适当频率分布中比随机重排序列在其适当分布中的大多数最佳得分占据更极端的位置。这种“最佳匹配”序列比较方法的应用表明,Needleman和Wunsch(1970)的“最大匹配”方法的灵敏度随着仅需要少量空位就能实现合理比对的序列比较,或者当序列长度差异很大时,会急剧下降。Barker和Dayhoff(1972)应用的“最大匹配”方法还有一个额外的缺点,即两个同源蛋白质序列中较长的那个中发生的缺失会进一步降低关系检测的灵敏度。Sankoff和Cedergren(1973)的“受限匹配”方法被认为具有误导性,因为添加空位导致的比对得分大幅增加不一定会产生证明序列同源性所需的高总比对得分。