Altschul S F
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894.
J Mol Evol. 1993 Mar;36(3):290-300. doi: 10.1007/BF00160485.
Protein sequence alignments generally are constructed with the aid of a "substitution matrix" that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a "log-odds" matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.
蛋白质序列比对通常借助“替换矩阵”构建,该矩阵为每对氨基酸比对指定一个分数。假设一个简单的随机蛋白质模型,可以证明,任何这样的矩阵,当用于评估可变长度的局部比对时,隐含地是一个“对数优势”矩阵,具有特定的氨基酸对概率分布,它是为该分布量身定制的。给定一个可从中推导此类分布的蛋白质进化模型,就可以构建一个适用于检测任何选定进化距离处关系的替换矩阵。因为在数据库搜索中,通常事先不知道哪些进化距离将表征所发现的相似性,所以有必要使用适当范围的矩阵,以免忽略潜在的同源性。本文通过定义一个在所有可检测进化距离上都敏感的评分系统,将这一概念形式化。分析了该评分系统的统计行为,结果表明,对于典型的蛋白质数据库搜索,估计适合每个比对的原本未知的进化距离,大约需要略多于两位的信息,或者说在统计显著性上略小于五倍的系数。然而,如果只使用一个对应错误进化距离的单一替换矩阵,可能会产生大得多的代价。