Camoğlu Orhan, Kahveci Tamer, Singh Ambuj K
Department of Computer Science, University of California, Santa Barbara, 93106, USA.
Proc IEEE Comput Soc Bioinform Conf. 2003;2:148-58.
We propose two methods for finding similarities in protein structure databases. Our techniques extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. These feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. This technique quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times while keeping the sensitivity similar.
我们提出了两种在蛋白质结构数据库中寻找相似性的方法。我们的技术在蛋白质的二级结构元件(SSE)三元组上提取特征向量。然后使用多维索引结构对这些特征向量进行索引。我们的第一种技术考虑在蛋白质数据集中寻找与给定查询蛋白质相似的蛋白质的问题。该技术使用索引结构快速找到有前景的蛋白质。然后使用诸如VAST之类的流行成对比对工具将这些蛋白质与查询蛋白质进行比对。我们还开发了一种新颖的统计模型,以使用SSE来估计匹配的优度。我们的第二种技术考虑合并两个蛋白质数据集以找到全对全相似性的问题。实验结果表明,我们的技术在保持灵敏度相似的同时,将VAST的剪枝时间提高了3至3.5倍。