PSIST：使用后缀树对蛋白质结构进行索引

PSIST: indexing protein structures using suffix trees.

作者信息

Gao Feng, Zaki Mohammed J

机构信息

Department of Computer Science, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, USA.

出版信息

Proc IEEE Comput Syst Bioinform Conf. 2005:212-22. doi: 10.1109/csb.2005.46.

DOI:10.1109/csb.2005.46

PMID:16447979

Abstract

Approaches for indexing proteins, and for fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this paper, we developed a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Calpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results shows classification accuracy up to 97.8% and 99.4% at the superfamily and class level according to the SCOP classification, and shows that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results are competitive with the best previous methods.

摘要

用于蛋白质索引以及快速且可扩展地搜索与查询结构相似的结构的方法具有重要应用，例如蛋白质结构和功能预测、蛋白质分类以及药物发现。在本文中，我们开发了一种提取蛋白质结构局部特征向量的新方法。每个残基由一个三角形表示，一组残基之间的相关性通过Cα原子之间的距离以及三角形所在平面的法线之间的角度来描述。使用后缀树对归一化的局部特征向量进行索引。对于所有查询片段，后缀树可有效地用于检索最大匹配项，然后将这些匹配项链接起来以获得与数据库蛋白质的比对。通过与查询的比对得分来选择相似蛋白质。根据SCOP分类，我们的结果在超家族和类别级别上分别显示出高达97.8%和99.4%的分类准确率，并且表明在前10个匹配项中平均能从同一个超家族中获得10个蛋白质中的7.49个。这些结果与之前最好的方法具有竞争力。