Kupas Katrin, Ultsch Alfred, Klebe Gerhard
Data Bionics Research Group, University of Marburg, Hans-Meerwein-Strasse, D-35032 Marburg, Germany.
Proteins. 2008 May 15;71(3):1288-306. doi: 10.1002/prot.21823.
A new method to discover similar substructures in protein binding pockets, independently of sequence and folding patterns or secondary structure elements, is introduced. The solvent-accessible surface of a binding pocket, automatically detected as a depression on the protein surface, is divided into a set of surface patches. Each surface patch is characterized by its shape as well as by its physicochemical characteristics. Wavelets defined on surfaces are used for the description of the shape, as they have the great advantage of allowing a comparison at different resolutions. The number of coefficients to describe the wavelets can be chosen with respect to the size of the considered data set. The physicochemical characteristics of the patches are described by the assignment of the exposed amino acid residues to one or more of five different properties determinant for molecular recognition. A self-organizing neural network is used to project the high-dimensional feature vectors onto a two-dimensional layer of neurons, called a map. To find similarities between the binding pockets, in both geometrical and physicochemical features, a clustering of the projected feature vector is performed using an automatic distance- and density-based clustering algorithm. The method was validated with a small training data set of 109 binding cavities originating from a set of enzymes covering 12 different EC numbers. A second test data set of 1378 binding cavities, extracted from enzymes of 13 different EC numbers, was then used to prove the discriminating power of the algorithm and to demonstrate its applicability to large scale analyses. In all cases, members of the data set with the same EC number were placed into coherent regions on the map, with small distances between them. Different EC numbers are separated by large distances between the feature vectors. A third data set comprising three subfamilies of endopeptidases is used to demonstrate the ability of the algorithm to detect similar substructures between functionally related active sites. The algorithm can also be used to predict the function of novel proteins not considered in training data set.
本文介绍了一种发现蛋白质结合口袋中相似子结构的新方法,该方法独立于序列、折叠模式或二级结构元件。结合口袋的溶剂可及表面会自动检测为蛋白质表面的凹陷,并被划分为一组表面斑块。每个表面斑块通过其形状以及物理化学特征来表征。定义在表面上的小波用于描述形状,因为它们具有能够在不同分辨率下进行比较的巨大优势。可以根据所考虑数据集的大小选择描述小波的系数数量。斑块的物理化学特征通过将暴露的氨基酸残基分配到分子识别的五个不同决定性属性中的一个或多个来描述。使用自组织神经网络将高维特征向量投影到称为映射图的二维神经元层上。为了在几何和物理化学特征方面找到结合口袋之间的相似性,使用基于自动距离和密度的聚类算法对投影后的特征向量进行聚类。该方法通过一个包含109个结合腔的小训练数据集进行了验证,这些结合腔来自一组涵盖12个不同酶委员会(EC)编号的酶。然后使用从13个不同EC编号的酶中提取的1378个结合腔的第二个测试数据集来证明该算法的区分能力,并展示其在大规模分析中的适用性。在所有情况下,具有相同EC编号的数据集成员被放置在映射图上的连贯区域中,它们之间的距离很小。不同的EC编号由特征向量之间的大距离分隔开。第三个数据集包含内肽酶的三个亚家族,用于证明该算法检测功能相关活性位点之间相似子结构的能力。该算法还可用于预测训练数据集中未考虑的新型蛋白质的功能。