Henikoff J G, Henikoff S
Howard Hughes Medical Institute, Basic Sciences Division, Seattle, WA 98104, USA.
Comput Appl Biosci. 1996 Apr;12(2):135-43. doi: 10.1093/bioinformatics/12.2.135.
Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching applications, the count vector is an imperfect representation of a position, because the observed sequences are an incomplete sample of the full set of related sequences. One general solution to this problem is to model unobserved sequences by adding artificial 'pseudo-counts' to the observed counts. We introduce a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities. In extensive empirical tests, this position-based method out-performed other pseudo-count methods and was a substantial improvement over the traditional average score method used for constructing profiles.
蛋白质序列多重比对中的每一列氨基酸都可以表示为一个由20种氨基酸计数组成的向量。对于比对和搜索应用,计数向量是对一个位置的不完美表示,因为观察到的序列只是完整相关序列集的一个不完整样本。解决这个问题的一个通用方法是通过向观察到的计数中添加人工“伪计数”来对未观察到的序列进行建模。我们引入了一种计算伪计数的简单方法,该方法将每个比对位置观察到的多样性与氨基酸替换概率相结合。在广泛的实证测试中,这种基于位置的方法优于其他伪计数方法,并且相对于用于构建图谱的传统平均得分方法有了实质性的改进。