Galzitskaia O V, Dovidchenko N V, Lobanov M Iu, Garbuzinskiĭ S A
Mol Biol (Mosk). 2006 Jan-Feb;40(1):111-21.
We have created a database of two-domain proteins with homology less than 25% (452 proteins). Based on one half of this set of proteins statistics of appearance of amino acid residues on the domain boundaries of multiple domain proteins has been obtained. Small and hydrophilic amino acids (proline, glycine, asparagine, glutamic acid, arginine and others) appear on the domain boundaries more often than in the whole protein. Opposite, hydrophobic amino acid residues (tryptophane, methionine, phenylalanine and others) appear on the domain boundaries more rarely. The obtained scales of the appearance of amino acid residues on the boundary regions from the statistics have been used for calculation of domain boundaries in the proteins of the second half of the database. The probability scale obtained by averaging the appearance of amino acid residues on the domain boundary region including 8 residues (+/-4 residues from the real domain boundary) gives the best result: for 57% of proteins the predicted boundary was closer than 40 residues to the boundary assigned from three-dimensional structures, for 41% it was closer than 20 residues from the real boundary. The probability scale was used to predict domain boundaries for proteins with unknown three-dimensional structure (international competition CASP6).
我们创建了一个由同源性低于25%的双结构域蛋白质组成的数据库(452种蛋白质)。基于这组蛋白质中的一半,我们获得了多结构域蛋白质结构域边界上氨基酸残基出现情况的统计数据。与整个蛋白质相比,小的亲水性氨基酸(脯氨酸、甘氨酸、天冬酰胺、谷氨酸、精氨酸等)在结构域边界上出现的频率更高。相反,疏水性氨基酸残基(色氨酸、甲硫氨酸、苯丙氨酸等)在结构域边界上出现的频率更低。根据统计数据得到的氨基酸残基在边界区域出现的比例,已被用于计算数据库后半部分蛋白质中的结构域边界。通过对包括8个残基(从实际结构域边界起±4个残基)的结构域边界区域上氨基酸残基出现情况进行平均得到的概率比例给出了最佳结果:对于57%的蛋白质,预测边界与从三维结构确定的边界的距离小于40个残基,对于41%的蛋白质,预测边界与实际边界的距离小于20个残基。该概率比例被用于预测三维结构未知的蛋白质的结构域边界(国际蛋白质结构预测竞赛CASP6)。