Cramer Richard D, Cruz Phillip, Stahl Gunther, Curtiss William C, Campbell Brian, Masek Brian B, Soltanshahi Farhad
Tripos International, 1699 South Hanley Road, St. Louis, Missouri 63144, USA.
J Chem Inf Model. 2008 Nov;48(11):2180-95. doi: 10.1021/ci8001556.
Multiple R-groups (monovalent fragments) are implicitly accessible within most of the molecular structures that populate large structural databases. R-group searching would desirably consider pIC50 contribution forecasts as well as ligand similarities or docking scores. However, R-group searching, with or without pIC50 forecasts, is currently not practical. The most prevalent and reliable source of pIC50 predictions, existing 3D-QSAR approaches, is also difficult and somewhat subjective. Yet in 25 of 25 trials on data sets on which a field-based 3D-QSAR treatment had already succeeded, substitution of objective (canonically generated) topomer poses for the original structure-guided manual alignments produced acceptable 3D-QSAR models, on average having almost equivalent statistical quality to the published models, and with negligible effort. Their overall pIC50 prediction error is 0.805, calculated as the average over these 25 topomer CoMFA models in the standard deviations of pIC50 predictions, derived from the 1109 possible "leave-out-one-R-group" (LOORG) pIC50 contributions. (This novel LOORG protocol provides a more realistic and stringent test of prediction accuracy than the customary "leave-out-one-compound" LOO approach.) The associated average predictive r(2) of 0.495 indicates a pIC50 prediction accuracy roughly halfway between perfect and useless. To assess the ability of topomer-CoMFA based virtual screening to identify "highly active" R-groups, a Receiver Operating Curve (ROC) approach was adopted. Using, as the binary criterion for a "highly active" R-group, a predicted pIC50 greater than the top 25% of the observed pIC50 range, the ROC area averaged across the 25 topomer CoMFA models is 0.729. Conventionally interpreted, the odds that a "highly active" R-group will indeed confer such a high pIC50 are 0.729/(1-0.729) or almost 3 to 1. To confirm that virtual screening within large collections of realized structures would provide a useful quantity and variety of R-group suggestions, combining shape similarity with the "highly active" pIC50, the 50 searches provided by these 25 models were applied to 2.2 million structurally distinct R-group candidates among 2.0 million structures within a ZINC database, identifying an average of 5705 R-groups per search, with the highest predicted pIC50 combination averaging 1.6 log units greater than the highest reported pIC50s.
在构成大型结构数据库的大多数分子结构中,多个R基团(单价片段)都隐含地可被访问。R基团搜索理想情况下应考虑pIC50贡献预测以及配体相似性或对接分数。然而,无论有无pIC50预测,目前R基团搜索都不实用。pIC50预测最普遍且可靠的来源,即现有的3D-QSAR方法,也很困难且有点主观。然而,在基于场的3D-QSAR处理已经成功的数据集的25次试验中,用客观(规范生成)的拓扑异构体构象替代原始结构引导的手动比对,平均产生了可接受的3D-QSAR模型,其统计质量几乎与已发表的模型相当,且工作量可忽略不计。它们的总体pIC50预测误差为0.805,计算方法是这25个拓扑异构体CoMFA模型在pIC50预测标准差上的平均值,该标准差源自1109种可能的“留一R基团”(LOORG)pIC50贡献。(这种新颖的LOORG方案比传统的“留一化合物”(LOO)方法对预测准确性提供了更现实和严格的测试。)相关的平均预测r(2)为0.495,表明pIC50预测准确性大致处于完美和无用之间。为了评估基于拓扑异构体CoMFA的虚拟筛选识别“高活性”R基团的能力,采用了接收器操作曲线(ROC)方法。使用预测的pIC50大于观察到的pIC50范围的前25%作为“高活性”R基团的二元标准,25个拓扑异构体CoMFA模型的平均ROC面积为0.729。按照传统解释,“高活性”R基团确实赋予如此高pIC50的几率为0.729/(1 - 0.729),即几乎为3比1。为了确认在大量已实现结构的集合中进行虚拟筛选将提供有用数量和种类的R基团建议,将形状相似性与“高活性”pIC50相结合,这25个模型提供的50次搜索应用于ZINC数据库中200万个结构内的220万个结构上不同的R基团候选物,每次搜索平均识别出5705个R基团,预测的最高pIC50组合平均比报告的最高pIC50高1.6个对数单位。