Department of Computer Science, University of Haifa, Mount Carmel, Haifa 31905, Israel.
Proc Natl Acad Sci U S A. 2010 Feb 23;107(8):3481-6. doi: 10.1073/pnas.0914097107. Epub 2010 Feb 3.
Fast identification of protein structures that are similar to a specified query structure in the entire Protein Data Bank (PDB) is fundamental in structure and function prediction. We present FragBag: An ultrafast and accurate method for comparing protein structures. We describe a protein structure by the collection of its overlapping short contiguous backbone segments, and discretize this set using a library of fragments. Then, we succinctly represent the protein as a "bags-of-fragments"-a vector that counts the number of occurrences of each fragment-and measure the similarity between two structures by the similarity between their vectors. Our representation has two additional benefits: (i) it can be used to construct an inverted index, for implementing a fast structural search engine of the entire PDB, and (ii) one can specify a structure as a collection of substructures, without combining them into a single structure; this is valuable for structure prediction, when there are reliable predictions only of parts of the protein. We use receiver operating characteristic curve analysis to quantify the success of FragBag in identifying neighbor candidate sets in a dataset of over 2,900 structures. The gold standard is the set of neighbors found by six state of the art structural aligners. Our best FragBag library finds more accurate candidate sets than the three other filter methods: The SGM, PRIDE, and a method by Zotenko et al. More interestingly, FragBag performs on a par with the computationally expensive, yet highly trusted structural aligners STRUCTAL and CE.
快速识别蛋白质结构与指定查询结构在整个蛋白质数据库 (PDB) 中的相似性是结构和功能预测的基础。我们提出了 FragBag:一种快速准确的蛋白质结构比较方法。我们通过收集其重叠的短连续骨架片段来描述蛋白质结构,并使用片段库对该集合进行离散化。然后,我们简洁地将蛋白质表示为“片段袋”——一个计数每个片段出现次数的向量,并通过比较它们的向量来测量两个结构之间的相似性。我们的表示有两个额外的好处:(i) 它可用于构建倒排索引,以实现整个 PDB 的快速结构搜索引擎,(ii) 可以将结构指定为子结构的集合,而无需将它们组合成单个结构;这对于结构预测很有价值,因为蛋白质的某些部分有可靠的预测。我们使用接收者操作特征曲线分析来量化 FragBag 在识别超过 2900 个结构的数据集的邻居候选集方面的成功。黄金标准是由六个最先进的结构比对器找到的邻居集。我们最好的 FragBag 库比其他三种过滤方法:SGM、PRIDE 和 Zotenko 等人的方法找到更准确的候选集。更有趣的是,FragBag 的性能与计算成本高昂但非常可靠的结构比对器 STRUCTAL 和 CE 相当。