Xie Lei, Xie Li, Bourne Philip E
San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA.
Bioinformatics. 2009 Jun 15;25(12):i305-12. doi: 10.1093/bioinformatics/btp220.
Functional relationships between proteins that do not share global structure similarity can be established by detecting their ligand-binding-site similarity. For a large-scale comparison, it is critical to accurately and efficiently assess the statistical significance of this similarity. Here, we report an efficient statistical model that supports local sequence order independent ligand-binding-site similarity searching. Most existing statistical models only take into account the matching vertices between two sites that are defined by a fixed number of points. In reality, the boundary of the binding site is not known or is dependent on the bound ligand making these approaches limited. To address these shortcomings and to perform binding-site mapping on a genome-wide scale, we developed a sequence-order independent profile-profile alignment (SOIPPA) algorithm that is able to detect local similarity between unknown binding sites a priori. The SOIPPA scoring integrates geometric, evolutionary and physical information into a unified framework. However, this imposes a significant challenge in assessing the statistical significance of the similarity because the conventional probability model that is based on fixed-point matching cannot be applied. Here we find that scores for binding-site matching by SOIPPA follow an extreme value distribution (EVD). Benchmark studies show that the EVD model performs at least two-orders faster and is more accurate than the non-parametric statistical method in the previous SOIPPA version. Efficient statistical analysis makes it possible to apply SOIPPA to genome-based drug discovery. Consequently, we have applied the approach to the structural genome of Mycobacterium tuberculosis to construct a protein-ligand interaction network. The network reveals highly connected proteins, which represent suitable targets for promiscuous drugs.
通过检测蛋白质的配体结合位点相似性,可以建立不具有整体结构相似性的蛋白质之间的功能关系。对于大规模比较而言,准确高效地评估这种相似性的统计学意义至关重要。在此,我们报告了一种高效的统计模型,该模型支持局部序列顺序独立的配体结合位点相似性搜索。大多数现有的统计模型仅考虑由固定数量的点定义的两个位点之间的匹配顶点。实际上,结合位点的边界未知或取决于结合的配体,这使得这些方法具有局限性。为了解决这些缺点并在全基因组范围内进行结合位点映射,我们开发了一种序列顺序独立的轮廓-轮廓比对(SOIPPA)算法,该算法能够先验地检测未知结合位点之间的局部相似性。SOIPPA评分将几何、进化和物理信息整合到一个统一的框架中。然而,这在评估相似性的统计学意义方面带来了重大挑战,因为基于定点匹配的传统概率模型无法应用。在此我们发现,SOIPPA进行结合位点匹配的得分遵循极值分布(EVD)。基准研究表明,EVD模型在速度上至少比前一版SOIPPA中的非参数统计方法快两个数量级,且更准确。高效的统计分析使得将SOIPPA应用于基于基因组的药物发现成为可能。因此,我们已将该方法应用于结核分枝杆菌的结构基因组,以构建蛋白质-配体相互作用网络。该网络揭示了高度连接的蛋白质,这些蛋白质代表了多效性药物的合适靶点。