蛋白质序列的拓扑图

Topological maps of protein sequences.

作者信息

Ferrán E A, Ferrara P

机构信息

Sanofi Elf Bio Recherches, Lebège Innopole, France.

出版信息

Biol Cybern. 1991;65(6):451-8. doi: 10.1007/BF00204658.

DOI:10.1007/BF00204658

PMID:1958730

Abstract

A new method based on neural networks to cluster proteins into families is described. The network is trained with the Kohonen unsupervised learning algorithm, using matrix pattern representations of the protein sequences as inputs. The components (x, y) of these 20 x 20 matrix patterns are the normalized frequencies of all pairs xy of amino acids in each sequence. We investigate the influence of different learning parameters in the final topological maps obtained with a learning set of ten proteins belonging to three established families. In all cases, except in those where the synaptic vectors remains nearly unchanged during learning, the ten proteins are correctly classified into the expected families. The classification by the trained network of mutated or incomplete sequences of the learned proteins is also analysed. The neural network gives a correct classification for a sequence mutated in 21.5% +/- 7% of its amino acids and for fragments representing 7.5% +/- 3% of the original sequence. Similar results were obtained with a learning set of 32 proteins belonging to 15 families. These results show that a neural network can be trained following the Kohonen algorithm to obtain topological maps of protein sequences, where related proteins are finally associated to the same winner neuron or to neighboring ones, and that the trained network can be applied to rapidly classify new sequences. This approach opens new possibilities to find rapid and efficient algorithms to organize and search for homologies in the whole protein database.

摘要

描述了一种基于神经网络将蛋白质聚类成家族的新方法。该网络使用Kohonen无监督学习算法进行训练，将蛋白质序列的矩阵模式表示作为输入。这些20×20矩阵模式的分量（x，y）是每个序列中所有氨基酸对xy的归一化频率。我们研究了不同学习参数对使用属于三个既定家族的十个蛋白质的学习集获得的最终拓扑图的影响。在所有情况下，除了那些在学习过程中突触向量几乎保持不变的情况外，这十个蛋白质都被正确分类到预期的家族中。还分析了训练后的网络对所学蛋白质的突变或不完整序列的分类。对于氨基酸突变率为21.5%±7%的序列以及代表原始序列7.5%±3%的片段，神经网络给出了正确的分类。使用属于15个家族的32个蛋白质的学习集也获得了类似的结果。这些结果表明，可以按照Kohonen算法训练神经网络以获得蛋白质序列的拓扑图，其中相关蛋白质最终与同一个获胜神经元或相邻神经元相关联，并且训练后的网络可用于快速分类新序列。这种方法为找到快速有效的算法来组织和搜索整个蛋白质数据库中的同源性开辟了新的可能性。