Martin Eric J, Polyakov Valery R, Tian Li, Perez Rolando C
Novartis Institutes for Biomedical Research , 5300 Chiron Way, Emeryville, California 94608-2916, United States.
J Chem Inf Model. 2017 Aug 28;57(8):2077-2088. doi: 10.1021/acs.jcim.7b00166. Epub 2017 Jul 26.
While conventional random forest regression (RFR) virtual screening models appear to have excellent accuracy on random held-out test sets, they prove lacking in actual practice. Analysis of 18 historical virtual screens showed that random test sets are far more similar to their training sets than are the compounds project teams actually order. A new, cluster-based "realistic" training/test set split, which mirrors the chemical novelty of real-life virtual screens, recapitulates the poor predictive power of RFR models in real projects. The original Profile-QSAR (pQSAR) method greatly broadened the domain of applicability over conventional models by using as independent variables a profile of activity predictions from all historical assays in a large protein family. However, the accuracy still fell short of experiment on realistic test sets. The improved "pQSAR 2.0" method replaces probabilities of activity from naïve Bayes categorical models at several thresholds with predicted ICs from RFR models. Unexpectedly, the high accuracy also requires removing the RFR model for the actual assay of interest from the independent variable profile. With these improvements, pQSAR 2.0 activity predictions are now statistically comparable to medium-throughput four-concentration IC measurements even on the realistic test set. Beyond the yes/no activity predictions from a typical high-throughput screen (HTS) or conventional virtual screen, these semiquantitative IC predictions allow for predicted potency, ligand efficiency, lipophilic efficiency, and selectivity against antitargets, greatly facilitating hitlist triaging and enabling virtual screening panels such as toxicity panels and overall promiscuity predictions.
虽然传统的随机森林回归(RFR)虚拟筛选模型在随机留出的测试集上似乎具有出色的准确性,但在实际应用中却存在不足。对18个历史虚拟筛选的分析表明,随机测试集与其训练集的相似性远远高于化合物项目团队实际订购的化合物。一种新的基于聚类的“现实”训练/测试集划分方法,反映了现实生活中虚拟筛选的化学新颖性,重现了RFR模型在实际项目中预测能力较差的情况。原始的Profile-QSAR(pQSAR)方法通过将来自一个大蛋白质家族中所有历史测定的活性预测概况用作自变量,大大拓宽了适用范围,超过了传统模型。然而,在现实测试集上,其准确性仍低于实验结果。改进后的“pQSAR 2.0”方法用RFR模型预测的IC值取代了朴素贝叶斯分类模型在几个阈值下的活性概率。出乎意料的是,要实现高精度还需要从自变量概况中去除针对感兴趣的实际测定的RFR模型。通过这些改进,即使在现实测试集上,pQSAR 2.0的活性预测在统计学上也与中通量四浓度IC测量相当。除了典型的高通量筛选(HTS)或传统虚拟筛选的是/否活性预测之外,这些半定量的IC预测还能得出预测的效力、配体效率、亲脂性效率以及对反靶标的选择性,极大地促进了命中列表的筛选,并实现了如毒性筛选和总体 promiscuity 预测等虚拟筛选面板。