Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan 48109-1065, United States.
J Chem Inf Model. 2011 Sep 26;51(9):2036-46. doi: 10.1021/ci200082t. Epub 2011 Jul 22.
A major goal in drug design is the improvement of computational methods for docking and scoring. The Community Structure Activity Resource (CSAR) aims to collect available data from industry and academia which may be used for this purpose ( www.csardock.org ). Also, CSAR is charged with organizing community-wide exercises based on the collected data. The first of these exercises was aimed to gauge the overall state of docking and scoring, using a large and diverse data set of protein-ligand complexes. Participants were asked to calculate the affinity of the complexes as provided and then recalculate with changes which may improve their specific method. This first data set was selected from existing PDB entries which had binding data (K(d) or K(i)) in Binding MOAD, augmented with entries from PDB bind. The final data set contains 343 diverse protein-ligand complexes and spans 14 pK(d). Sixteen proteins have three or more complexes in the data set, from which a user could start an inspection of congeneric series. Inherent experimental error limits the possible correlation between scores and measured affinity; Pearson R is limited to ~ 0.91 (Pearson R2 0.83) when fitting to the data set without over parameterizing. Pearson R is limited to ~ 0.83(Pearson R2 ~ 0.70) when scoring the data set with a method trained on outside data [corrected]. The details of how the data set was initially selected, and the process by which it matured to better fit the needs of the community are presented. Many groups generously participated in improving the data set, and this underscores the value of a supportive, collaborative effort in moving our field forward.
药物设计的主要目标是改进对接和评分的计算方法。社区结构活性资源(CSAR)旨在收集来自工业界和学术界的可用数据,这些数据可用于此目的(www.csardock.org)。此外,CSAR 负责根据收集的数据组织全社区的练习。这些练习中的第一个旨在使用大量多样的蛋白质 - 配体复合物数据集来评估对接和评分的总体状态。要求参与者根据提供的复合物计算亲和力,然后用可能改进其特定方法的变化重新计算。该数据集最初是从具有结合数据(K(d)或 K(i))的现有 PDB 条目(Binding MOAD)中选择的,并用来自 PDB bind 的条目进行扩充。最终数据集包含 343 个不同的蛋白质 - 配体复合物,跨越 14 个 pK(d)。16 个蛋白质在数据集中有三个或更多的复合物,用户可以从这些复合物开始检查同系物系列。固有实验误差限制了评分与测量亲和力之间可能的相关性;当不过度参数化拟合数据集时,Pearson R 限制在 ~ 0.91(Pearson R2 0.83)。当使用在外部数据上训练的方法对数据集进行评分时,Pearson R 限制在 ~ 0.83(Pearson R2 ~ 0.70)[更正]。介绍了数据集最初如何选择的详细信息,以及如何使其成熟以更好地满足社区需求的过程。许多团体慷慨参与了数据集的改进,这突显了在推动我们的领域前进方面,支持和协作努力的价值。