Zotenko Elena, O'Leary Dianne P, Przytycka Teresa M
Department of Computer Science, University of Maryland, College Park, MD 20742, USA.
BMC Struct Biol. 2006 Jun 8;6:12. doi: 10.1186/1472-6807-6-12.
Recently a new class of methods for fast protein structure comparison has emerged. We call the methods in this class projection methods as they rely on a mapping of protein structure into a high-dimensional vector space. Once the mapping is done, the structure comparison is reduced to distance computation between corresponding vectors. As structural similarity is approximated by distance between projections, the success of any projection method depends on how well its mapping function is able to capture the salient features of protein structure. There is no agreement on what constitutes a good projection technique and the three currently known projection methods utilize very different approaches to the mapping construction, both in terms of what structural elements are included and how this information is integrated to produce a vector representation.
In this paper we propose a novel projection method that uses secondary structure information to produce the mapping. First, a diverse set of spatial arrangements of triplets of secondary structure elements, a set of structural models, is automatically selected. Then, each protein structure is mapped into a high-dimensional vector of "counts" or footprint, where each count corresponds to the number of times a given structural model is observed in the structure, weighted by the precision with which the model is reproduced. We perform the first comprehensive evaluation of our method together with all other currently known projection methods.
The results of our evaluation suggest that the type of structural information used by a projection method affects the ability of the method to detect structural similarity. In particular, our method that uses the spatial conformations of triplets of secondary structure elements outperforms other methods in most of the tests.
最近出现了一类用于快速蛋白质结构比较的新方法。我们将这类方法称为投影方法,因为它们依赖于将蛋白质结构映射到高维向量空间。一旦完成映射,结构比较就简化为相应向量之间的距离计算。由于结构相似性通过投影之间的距离来近似,任何投影方法的成功都取决于其映射函数能够多好地捕捉蛋白质结构的显著特征。对于什么构成一种好的投影技术尚无共识,目前已知的三种投影方法在映射构建方面采用了非常不同的方法,无论是在包含哪些结构元素以及如何整合这些信息以生成向量表示方面。
在本文中,我们提出了一种使用二级结构信息来进行映射的新型投影方法。首先,自动选择一组多样化的二级结构元素三联体的空间排列,即一组结构模型。然后,将每个蛋白质结构映射到一个由“计数”或足迹组成的高维向量中,其中每个计数对应于在该结构中观察到给定结构模型的次数,并由模型再现的精度加权。我们对我们的方法以及所有其他目前已知的投影方法进行了首次全面评估。
我们的评估结果表明,投影方法所使用的结构信息类型会影响该方法检测结构相似性的能力。特别是,我们使用二级结构元素三联体空间构象的方法在大多数测试中优于其他方法。