Altschul Stephen F, Wootton John C, Gertz E Michael, Agarwala Richa, Morgulis Aleksandr, Schäffer Alejandro A, Yu Yi-Kuo
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
FEBS J. 2005 Oct;272(20):5101-9. doi: 10.1111/j.1742-4658.2005.04945.x.
Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.
几乎所有蛋白质数据库搜索方法都使用氨基酸替换矩阵来进行序列比对的评分、优化及统计显著性评估。因此,构建替换矩阵投入了大量的精力,搜索结果的质量在很大程度上取决于合适矩阵的选择。长期存在的一个问题是具有偏向性氨基酸组成的序列之间的比较,对于这类序列,标准替换矩阵并非最优选择。为解决这一问题,我们最近开发了一种通用方法,可将标准矩阵转换为适用于比较具有任意组成(可能不同)的两个序列的矩阵。当应用于具有明显偏向性组成的蛋白质比较时,这种经过调整的矩阵平均能产生更好的比对和比对得分。在此,我们回顾了成分调整矩阵的应用,并探讨它们是否也能有效地应用于通用蛋白质序列数据库搜索,在这类搜索中相关序列对不一定具有很强的组成偏向性。虽然不加区分地应用成分调整并不可取,但我们描述了几个简单的标准,在这些标准下进行这种调整平均而言是有益的。在典型的数据库搜索中,超过半数的相关序列对至少满足其中一个标准。成分替换矩阵调整现已在NCBI的蛋白质-蛋白质版本的Blast中可用。