用于蛋白质序列模式识别的自组织层次网络

Self-organizing hierarchic networks for pattern recognition in protein sequence.

作者信息

Hanke J, Beckmann G, Bork P, Reich J G

机构信息

Max-Delbrück-Center for Molecular Medicine, Department of Bioinformatics, Berlin-Buch, Germany.

出版信息

Protein Sci. 1996 Jan;5(1):72-82. doi: 10.1002/pro.5560050109.

DOI:10.1002/pro.5560050109

PMID:8771198

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2143234/

Abstract

We present a method based on hierarchical self-organizing maps (SOMs) for recognizing patterns in protein sequences. The method is fully automatic, does not require prealigned sequences, is insensitive to redundancy in the training set, and works surprisingly well even with small learning sets. Because it uses unsupervised neural networks, it is able to extract patterns that are not present in all of the unaligned sequences of the learning set. The identification of these patterns in sequence databases is sensitive and efficient. The procedure comprises three main training stages. In the first stage, one SOM is trained to extract common features from the set of unaligned learning sequences. A feature is a number of ungapped sequence segments (usually 4-16 residues long) that are similar to segments in most of the sequences of the learning set according to an initial similarity matrix. In the second training stage, the recognition of each individual feature is refined by selecting an optimal weighting matrix out of a variety of existing amino acid similarity matrices. In a third stage of the SOM procedure, the position of the features in the individual sequences is learned. This allows for variants with feature repeats and feature shuffling. The procedure has been successfully applied to a number of notoriously difficult cases with distinct recognition problems: helix-turn-helix motifs in DNA-binding proteins, the CUB domain of developmentally regulated proteins, and the superfamily of ribokinases. A comparison with the established database search procedure PROFILE (and with several others) led to the conclusion that the new automatic method performs satisfactorily.

摘要

我们提出了一种基于分层自组织映射（SOM）的方法来识别蛋白质序列中的模式。该方法完全自动化，不需要预先比对的序列，对训练集中的冗余不敏感，即使学习集较小也能取得惊人的良好效果。由于它使用无监督神经网络，所以能够提取学习集中所有未比对序列中不存在的模式。在序列数据库中识别这些模式既灵敏又高效。该过程包括三个主要训练阶段。在第一阶段，训练一个SOM从未比对的学习序列集中提取共同特征。一个特征是一些无间隙的序列片段（通常4 - 16个残基长），根据初始相似性矩阵，这些片段与学习集中大多数序列中的片段相似。在第二个训练阶段，通过从各种现有的氨基酸相似性矩阵中选择一个最优加权矩阵来细化对每个单独特征的识别。在SOM过程的第三个阶段，学习特征在各个序列中的位置。这允许存在具有特征重复和特征重排的变体。该过程已成功应用于一些存在明显识别问题的极其困难的案例：DNA结合蛋白中的螺旋-转角-螺旋基序、发育调控蛋白的CUB结构域以及核糖激酶超家族。与既定的数据库搜索程序PROFILE（以及其他几个程序）的比较得出结论，新的自动方法表现令人满意。