Jones D T, Taylor W R, Thornton J M
Department of Biochemistry and Molecular Biology, University College, London, UK.
Comput Appl Biosci. 1992 Jun;8(3):275-82. doi: 10.1093/bioinformatics/8.3.275.
An efficient means for generating mutation data matrices from large numbers of protein sequences is presented here. By means of an approximate peptide-based sequence comparison algorithm, the set sequences are clustered at the 85% identity level. The closest relating pairs of sequences are aligned, and observed amino acid exchanges tallied in a matrix. The raw mutation frequency matrix is processed in a similar way to that described by Dayhoff et al. (1978), and so the resulting matrices may be easily used in current sequence analysis applications, in place of the standard mutation data matrices, which have not been updated for 13 years. The method is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fast enough to generate a matrix from a specific family or class of proteins in minutes. Differences observed between our 250 PAM mutation data matrix and the matrix calculated by Dayhoff et al. are briefly discussed.
本文介绍了一种从大量蛋白质序列生成突变数据矩阵的有效方法。借助基于近似肽的序列比较算法,将序列集在85%同一性水平上进行聚类。对最相近的序列对进行比对,并在矩阵中统计观察到的氨基酸交换情况。原始突变频率矩阵的处理方式与Dayhoff等人(1978年)描述的类似,因此所得矩阵可轻松用于当前的序列分析应用中,以替代已13年未更新的标准突变数据矩阵。该方法速度足够快,在Sun SPARCstation 1上20小时内可处理整个SWISS-PROT数据库,并且在几分钟内就能从特定的蛋白质家族或类别中生成矩阵。我们简要讨论了250 PAM突变数据矩阵与Dayhoff等人计算的矩阵之间观察到的差异。