Ferrán E A, Ferrara P
Sanofi Elf Bio Recherches, Labège Innopole, France.
Comput Appl Biosci. 1992 Feb;8(1):39-44. doi: 10.1093/bioinformatics/8.1.39.
An artificial neural network was used to cluster proteins into families. The network, composed of 7 x 7 neurons, was trained with the Kohonen unsupervised learning algorithm using, as inputs, matrix patterns derived from the bipeptide composition of 447 proteins, belonging to 13 different families. As a result of the training, and without any a priori indication of the number or composition of the expected families, the network self-organized the activation of its neurons into topologically ordered maps in which almost all the proteins (96.7%) were correctly clustered into the corresponding families. In a second computational experiment, a similar network was trained with one family of the previous learning set (76 cytochrome c sequences). The new neural map clustered these proteins into 25 different neurons (five in the first experiment), wherein phylogenetically related sequences were positioned close to each other. This result shows that the network can adapt the clustering resolution to the complexity of the learning set, a useful feature when working with an unknown number of clusters. Although the learning stage is time consuming, once the topological map is obtained, the classification of new proteins is very fast. Altogether, our results suggest that this novel approach may be a useful tool to organize the search for homologies in large macromolecular databases.
使用人工神经网络将蛋白质聚类成家族。该网络由7×7个神经元组成,采用Kohonen无监督学习算法进行训练,其输入是从属于13个不同家族的447种蛋白质的双肽组成衍生而来的矩阵模式。训练的结果是,在没有关于预期家族数量或组成的任何先验指示的情况下,网络将其神经元的激活自组织成拓扑有序图,其中几乎所有蛋白质(96.7%)都被正确聚类到相应家族中。在第二个计算实验中,用前一个学习集的一个家族(76个细胞色素c序列)对类似的网络进行训练。新的神经图将这些蛋白质聚类到25个不同的神经元中(在第一个实验中有5个),其中系统发育相关的序列彼此靠近定位。这一结果表明,该网络可以使聚类分辨率适应学习集的复杂性,这在处理未知数量的聚类时是一个有用的特性。虽然学习阶段很耗时,但一旦获得拓扑图,新蛋白质的分类就非常快。总之,我们的结果表明,这种新方法可能是在大型大分子数据库中组织同源性搜索的有用工具。