School of Computer Science, University of Nottingham, Nottingham NG81BB, U.K.
IEEE Trans Nanobioscience. 2010 Jun;9(2):144-55. doi: 10.1109/TNB.2010.2043851.
Protein-structure comparison (PSC) is an essential component of biomedical research as it impacts on, e.g., drug design, molecular docking, protein folding and structure prediction algorithms as well as being essential to the assessment of these predictions. Each of these applications, as well as many others where molecular comparison plays an important role, requires a different notion of similarity that naturally lead to the multicriteria PSC (MC-PSC) problem. Protein (Structure) Comparison, Knowledge, Similarity, and Information (ProCKSI) (www.procksi.org) provides algorithmic solutions for the MC-PSC problem by means of an enhanced structural comparison that relies on the principled application of information fusion to similarity assessments derived from multiple comparison methods. Current MC-PSC works well for moderately sized datasets and it is time consuming as it provides public service to multiple users. Many of the structural bioinformatics applications mentioned above would benefit from the ability to perform, for a dedicated user, thousands or tens of thousands of comparisons through multiple methods in real time, a capacity beyond our current technology. In this paper, we take a key step into that direction by means of a high-throughput distributed reimplementation of ProCKSI for very large datasets. The core of the proposed framework lies in the design of an innovative distributed algorithm that runs on each compute node in a cluster/grid environment to perform structure comparison of a given subset of input structures using some of the most popular PSC methods [e.g., universal similarity metric (USM), maximum contact map overlap (MaxCMO), fast alignment and search tool (FAST), distance alignment (DaliLite), combinatorial extension (CE), template modeling alignment (TMAlign)]. We follow this with a procedure of distributed consensus building. Thus, the new algorithms proposed here achieve ProCKSI's similarity assessment quality but with a fraction of the time required by it. Our results show that the proposed distributed method can be used efficiently to compare: 1) a particular protein against a very large protein structures dataset (target-against-all comparison), and 2) a particular very large-scale dataset against itself or against another very large-scale dataset (all-against-all comparison). We conclude the paper by enumerating some of the outstanding challenges for real-time MC-PSC.
蛋白质结构比较 (PSC) 是生物医学研究的一个重要组成部分,因为它会影响药物设计、分子对接、蛋白质折叠和结构预测算法等,并且对于这些预测的评估也是必不可少的。这些应用中的每一个,以及许多其他分子比较起着重要作用的应用,都需要不同的相似性概念,这自然导致了多标准 PSC(MC-PSC)问题。蛋白质(结构)比较、知识、相似性和信息(ProCKSI)(www.procksi.org)通过一种增强的结构比较,为 MC-PSC 问题提供了算法解决方案,这种比较依赖于信息融合的原则应用于从多种比较方法中得出的相似性评估。目前的 MC-PSC 适用于中等大小的数据集,而且由于它为多个用户提供公共服务,所以时间消耗很大。上述许多结构生物信息学应用程序都将受益于能够通过多种方法实时为专用用户执行数千或数万个比较的能力,这是我们当前技术所无法达到的。在本文中,我们通过对非常大的数据集进行高通量分布式重新实现 ProCKSI,朝着这个方向迈出了关键的一步。所提出框架的核心在于设计一种创新的分布式算法,该算法在集群/网格环境中的每个计算节点上运行,使用一些最流行的 PSC 方法(例如通用相似性度量 (USM)、最大接触图重叠 (MaxCMO)、快速对齐和搜索工具 (FAST)、距离对齐 (DaliLite)、组合扩展 (CE)、模板建模对齐 (TMAlign)) 对给定输入结构子集进行结构比较。然后,我们按照分布式共识构建过程进行操作。因此,这里提出的新算法实现了 ProCKSI 的相似性评估质量,但所需时间仅为其一部分。我们的结果表明,所提出的分布式方法可以有效地用于比较:1)特定蛋白质与非常大的蛋白质结构数据集(目标对所有比较),2)特定非常大规模数据集与其自身或与另一个非常大规模数据集(所有对所有比较)。最后,我们列举了实时 MC-PSC 的一些突出挑战。