Obrezanova Olga, Gola Joelle M R, Champness Edmund J, Segall Matthew D
BioFocus DPI Ltd., Darwin Building, Chesterford Research Park, Saffron Walden, CB10 1XL, UK.
J Comput Aided Mol Des. 2008 Jun-Jul;22(6-7):431-40. doi: 10.1007/s10822-008-9193-8. Epub 2008 Feb 14.
In this article, we present an automatic model generation process for building QSAR models using Gaussian Processes, a powerful machine learning modeling method. We describe the stages of the process that ensure models are built and validated within a rigorous framework: descriptor calculation, splitting data into training, validation and test sets, descriptor filtering, application of modeling techniques and selection of the best model. We apply this automatic process to data sets of blood-brain barrier penetration and aqueous solubility and compare the resulting automatically generated models with 'manually' built models using external test sets. The results demonstrate the effectiveness of the automatic model generation process for two types of data sets commonly encountered in building ADME QSAR models, a small set of in vivo data and a large set of physico-chemical data.
在本文中,我们展示了一种使用高斯过程构建定量构效关系(QSAR)模型的自动模型生成过程,高斯过程是一种强大的机器学习建模方法。我们描述了该过程的各个阶段,这些阶段确保在严格的框架内构建和验证模型:描述符计算、将数据拆分为训练集、验证集和测试集、描述符筛选、建模技术的应用以及最佳模型的选择。我们将此自动过程应用于血脑屏障穿透和水溶性的数据集,并使用外部测试集将自动生成的模型与“手动”构建的模型进行比较。结果证明了自动模型生成过程对于构建药物吸收、分布、代谢和排泄(ADME)QSAR模型时常见的两种类型数据集的有效性,即一小部分体内数据和一大组物理化学数据。