Camoglu Orhan, Kahveci Tamer, Singh Ambuj K
Department of Computer Science University of California, Santa Barbara, CA 93106, USA.
Bioinformatics. 2003;19 Suppl 1:i81-3. doi: 10.1093/bioinformatics/btg1009.
We consider the problem of finding similarities in protein structure databases. Current techniques sequentially compare the given query protein to all of the proteins in the database to find similarities. Therefore, the cost of similarity queries increases linearly as the volume of the protein databases increase. As the sizes of experimentally determined and theoretically estimated protein structure databases grow, there is a need for scalable searching techniques.
Our techniques extract feature vectors on triplets of SSEs (Secondary Structure Elements). Later, these feature vectors are indexed using a multidimensional index structure. For a given query protein, this index structure is used to quickly prune away unpromising proteins in the database. The remaining proteins are then aligned using a popular alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times while maintaining similar sensitivity.
我们考虑在蛋白质结构数据库中寻找相似性的问题。当前技术将给定的查询蛋白质与数据库中的所有蛋白质进行顺序比较以找到相似性。因此,相似性查询的成本随着蛋白质数据库规模的增加而线性增长。随着通过实验确定的和理论估计的蛋白质结构数据库规模的扩大,需要可扩展的搜索技术。
我们的技术在二级结构元件(SSE)三元组上提取特征向量。随后,使用多维索引结构对这些特征向量进行索引。对于给定的查询蛋白质,该索引结构用于快速剔除数据库中没有希望的蛋白质。然后使用诸如VAST之类的流行比对工具对剩余的蛋白质进行比对。我们还开发了一种新颖的统计模型,以使用SSE来估计匹配的优度。实验结果表明,我们的技术将VAST的剪枝时间提高了3至3.5倍,同时保持了相似的灵敏度。