关于应用层次凝聚聚类方法来组织重复对接运行输出的比较研究。

A comparative study on the application of hierarchical-agglomerative clustering approaches to organize outputs of reiterated docking runs.

作者信息

Bottegoni Giovanni, Cavalli Andrea, Recanatini Maurizio

机构信息

Department of Pharmaceutical Sciences, University of Bologna, Via Belmeloro, 6-I-40126 Bologna, Italy.

出版信息

J Chem Inf Model. 2006 Mar-Apr;46(2):852-62. doi: 10.1021/ci050141q.

DOI:10.1021/ci050141q

PMID:16563017

Abstract

Reiterated runs of standard docking protocols usually provide a collection of possible binding modes rather than pinpoint a single solution. Usually, this ensemble is then ranked by means of an energy-based scoring function. However, since many degrees of approximation have to be introduced in the computation of the binding free energy, scoring functions cannot always rank the experimental pose among the top scorers. Cluster analysis might help to overcome this limit, provided that data clusterability has been earlier assessed. In this paper, first, we present a modified version of a test earlier developed by Hopkins to assess whether or not docking outputs show the natural tendency to be grouped in clusters. Then, we report the results of a comparative study on the application of different hierarchical-agglomerative cluster rules to partition docking outputs. The rule that was able to best manage the observed data was finally applied to the whole ensemble of poses collected from several docking tools. The combination of the average linkage rule with the cutting function developed by Sutcliffe and co-workers turned out to be an approach that meets all of the criteria required for a robust clustering protocol. Furthermore, a consensus clustering allowed us to identify the pose closest to the experimental one within a statistically significant cluster, whose number was always of few units.

摘要

重复运行标准对接协议通常会提供一系列可能的结合模式，而不是确定单一的解决方案。通常，然后通过基于能量的评分函数对这个集合进行排序。然而，由于在计算结合自由能时必须引入许多近似度，评分函数并不总是能将实验构象排在得分最高的前列。聚类分析可能有助于克服这一限制，前提是数据的可聚类性已经提前评估过。在本文中，首先，我们展示了一个由霍普金斯早期开发的测试的修改版本，以评估对接输出是否显示出自然聚类的倾向。然后，我们报告了一项比较研究的结果，该研究应用不同的层次凝聚聚类规则来划分对接输出。最终，将能够最好地处理观测数据的规则应用于从多个对接工具收集的所有构象集合。平均连锁规则与萨克利夫及其同事开发的切割函数相结合，结果证明是一种符合稳健聚类协议所需所有标准的方法。此外，通过一致性聚类，我们能够在一个具有统计学意义的聚类中识别出最接近实验构象的构象，该聚类中的构象数量通常很少。