Schroeter Timon, Schwaighofer Anton, Mika Sebastian, Laak Antonius Ter, Suelzle Detlev, Ganzer Ursula, Heinrich Nikolaus, Müller Klaus-Robert
Fraunhofer FIRST, Kekuléstrasse 7, 12489 Berlin, Germany.
Mol Pharm. 2007 Jul-Aug;4(4):524-38. doi: 10.1021/mp0700413. Epub 2007 Jul 19.
Unfavorable lipophilicity and water solubility cause many drug failures; therefore these properties have to be taken into account early on in lead discovery. Commercial tools for predicting lipophilicity usually have been trained on small and neutral molecules, and are thus often unable to accurately predict in-house data. Using a modern Bayesian machine learning algorithm--a Gaussian process model--this study constructs a log D7 model based on 14,556 drug discovery compounds of Bayer Schering Pharma. Performance is compared with support vector machines, decision trees, ridge regression, and four commercial tools. In a blind test on 7013 new measurements from the last months (including compounds from new projects) 81% were predicted correctly within 1 log unit, compared to only 44% achieved by commercial software. Additional evaluations using public data are presented. We consider error bars for each method (model based error bars, ensemble based, and distance based approaches), and investigate how well they quantify the domain of applicability of each model.
亲脂性和水溶性不佳导致许多药物研发失败;因此,在先导化合物发现的早期阶段就必须考虑这些性质。用于预测亲脂性的商业工具通常是基于小分子和中性分子进行训练的,因此往往无法准确预测内部数据。本研究使用一种现代贝叶斯机器学习算法——高斯过程模型,基于拜耳先灵医药公司的14556种药物研发化合物构建了一个log D7模型。将其性能与支持向量机、决策树、岭回归以及四种商业工具进行了比较。在对过去几个月的7013个新测量值(包括来自新项目化合物)的盲测中,81%的预测值在1个对数单位内正确,而商业软件仅达到44%。还给出了使用公共数据的其他评估结果。我们考虑了每种方法的误差条(基于模型的误差条、基于集成的误差条和基于距离的方法),并研究了它们对每个模型适用范围的量化程度。