Yu Yi-Kuo, Wootton John C, Altschul Stephen F
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Proc Natl Acad Sci U S A. 2003 Dec 23;100(26):15688-93. doi: 10.1073/pnas.2533904100. Epub 2003 Dec 8.
Amino acid substitution matrices are central to protein-comparison methods. In most commonly used matrices, the substitution scores take a log-odds form, involving the ratio of "target" to "background" frequencies derived from large, carefully curated sets of protein alignments. However, such matrices often are used to compare protein sequences with amino acid compositions that differ markedly from the background frequencies used for the construction of the matrices. Of course, the target frequencies should be adjusted in such cases, but the lack of an appropriate way to do this has been a long-standing problem. This article shows that if one demands consistency between target and background frequencies, then a log-odds substitution matrix implies a unique set of target and background frequencies as well as a unique scale. Standard substitution matrices therefore are truly appropriate only for the comparison of proteins with standard amino acid composition. Accordingly, we present and evaluate a rationale for transforming the target frequencies implicit in a standard matrix to frequencies appropriate for a nonstandard context. This rationale yields asymmetric matrices for the comparison of proteins with divergent compositions. Earlier approaches are unable to deal with this case in a fully consistent manner. Composition-specific substitution matrix adjustment is shown to be of utility for comparing compositionally biased proteins, including those of organisms with nucleotide-biased, and therefore codon-biased, genomes or isochores.
氨基酸替换矩阵是蛋白质比较方法的核心。在最常用的矩阵中,替换分数采用对数似然形式,涉及从大量精心整理的蛋白质比对集合中得出的“目标”频率与“背景”频率之比。然而,此类矩阵常常被用于比较氨基酸组成与构建矩阵时所使用的背景频率显著不同的蛋白质序列。当然,在这种情况下应该调整目标频率,但一直以来缺乏合适的调整方法。本文表明,如果要求目标频率与背景频率一致,那么对数似然替换矩阵意味着一组唯一的目标频率和背景频率以及一个唯一的尺度。因此,标准替换矩阵实际上仅适用于比较具有标准氨基酸组成的蛋白质。相应地,我们提出并评估了一种将标准矩阵中隐含的目标频率转换为适合非标准背景的频率的基本原理。这种基本原理产生了用于比较组成不同的蛋白质的不对称矩阵。早期的方法无法以完全一致的方式处理这种情况。组成特异性替换矩阵调整被证明对于比较具有组成偏向性的蛋白质很有用,包括那些来自具有核苷酸偏向性(因此密码子也有偏向性)的基因组或等染色体的生物体的蛋白质。