Schrödinger, Inc., 120 West 45th Street, New York, New York 10036, United States.
J Chem Inf Model. 2013 Sep 23;53(9):2312-21. doi: 10.1021/ci400250c. Epub 2013 Aug 19.
Numerous regression-based and machine learning techniques are available for the development of linear and nonlinear QSAR models that can accurately predict biological endpoints. Such tools can be quite powerful in the hands of an experienced modeler, but too frequently a disconnect remains between the modeler and project chemist because the resulting QSAR models are effectively black boxes. As a result, learning methods that yield models that can be visualized in the context of chemical structures are in high demand. In this work, we combine direct kernel-based PLS with Canvas 2D fingerprints to arrive at predictive QSAR models that can be projected onto the atoms of a chemical structure, allowing immediate identification of favorable and unfavorable characteristics. The method is validated using binding affinities for ligands from 10 different protein targets covering 7 distinct protein families. Models with significant predictive ability (test set Q(2) > 0.5) are obtained for 6 of 10 data sets, and fingerprints are shown to consistently outperform large collections of classical physicochemical and topological descriptors. In addition, we demonstrate how a simple bootstrapping technique may be employed to obtain uncertainties that provide meaningful estimates of prediction accuracy.
有许多基于回归和机器学习的技术可用于开发线性和非线性 QSAR 模型,这些模型可以准确地预测生物学终点。在经验丰富的建模者手中,这些工具可能非常强大,但由于得到的 QSAR 模型实际上是黑盒,建模者和项目化学家之间仍然存在脱节。因此,人们迫切需要能够在化学结构背景下可视化模型的学习方法。在这项工作中,我们将直接基于核的 PLS 与 Canvas 2D 指纹相结合,得出可以投射到化学结构原子上的预测性 QSAR 模型,从而可以立即识别有利和不利的特征。该方法使用来自 10 个不同蛋白质靶标覆盖 7 个不同蛋白质家族的配体的结合亲和力进行验证。对于 10 个数据集的 6 个数据集,获得了具有显著预测能力的模型(测试集 Q(2) > 0.5),并且指纹始终优于大量经典物理化学和拓扑描述符的集合。此外,我们还展示了如何使用简单的自举技术来获得不确定性,从而可以对预测准确性进行有意义的估计。