Kozakov Dima, Clodfelter Karl H, Vajda Sandor, Camacho Carlos J
Department of Biomedical Engineering, Boston University, Massachusetts, USA.
Biophys J. 2005 Aug;89(2):867-75. doi: 10.1529/biophysj.104.058768. Epub 2005 May 20.
Clustering is one of the most powerful tools in computational biology. The conventional wisdom is that events that occur in clusters are probably not random. In protein docking, the underlying principle is that clustering occurs because long-range electrostatic and/or desolvation forces steer the proteins to a low free-energy attractor at the binding region. Something similar occurs in the docking of small molecules, although in this case shorter-range van der Waals forces play a more critical role. Based on the above, we have developed two different clustering strategies to predict docked conformations based on the clustering properties of a uniform sampling of low free-energy protein-protein and protein-small molecule complexes. We report on significant improvements in the automated prediction and discrimination of docked conformations by using the cluster size and consensus as a ranking criterion. We show that the success of clustering depends on identifying the appropriate clustering radius of the system. The clustering radius for protein-protein complexes is consistent with the range of the electrostatics and desolvation free energies (i.e., between 4 and 9 Angstroms); for protein-small molecule docking, the radius is set by van der Waals interactions (i.e., at approximately 2 Angstroms). Without any a priori information, a simple analysis of the histogram of distance separations between the set of docked conformations can evaluate the clustering properties of the data set. Clustering is observed when the histogram is bimodal. Data clustering is optimal if one chooses the clustering radius to be the minimum after the first peak of the bimodal distribution. We show that using this optimal radius further improves the discrimination of near-native complex structures.
聚类是计算生物学中最强大的工具之一。传统观点认为,成簇发生的事件可能并非随机。在蛋白质对接中,其基本原理是,聚类的发生是因为长程静电和/或去溶剂化力将蛋白质导向结合区域的低自由能吸引子。小分子对接中也会出现类似情况,不过在这种情况下,短程范德华力发挥着更关键的作用。基于上述情况,我们基于低自由能蛋白质 - 蛋白质和蛋白质 - 小分子复合物的均匀采样的聚类特性,开发了两种不同的聚类策略来预测对接构象。我们报告了通过使用簇大小和一致性作为排名标准,在对接构象的自动预测和区分方面取得的显著改进。我们表明,聚类的成功取决于确定系统的合适聚类半径。蛋白质 - 蛋白质复合物的聚类半径与静电和去溶剂化自由能的范围一致(即4到9埃之间);对于蛋白质 - 小分子对接,半径由范德华相互作用设定(即约2埃)。在没有任何先验信息的情况下,对一组对接构象之间距离间隔直方图的简单分析可以评估数据集的聚类特性。当直方图为双峰时观察到聚类现象。如果选择聚类半径为双峰分布第一个峰值之后的最小值,则数据聚类是最优的。我们表明,使用这个最优半径进一步提高了对近天然复合物结构的区分能力。