B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial , Universitat Politècnica de Catalunya , 08028 Barcelona , Spain.
Mind the Byte S.L. , 08007 Barcelona , Spain.
J Chem Inf Model. 2019 Apr 22;59(4):1645-1657. doi: 10.1021/acs.jcim.8b00663. Epub 2019 Feb 22.
Binding prediction between targets and drug-like compounds through deep neural networks has generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database, and (4) splitting based both in the clustering and in the source database. These schemas are applied to a deep learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our deep learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compound clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.
通过深度神经网络进行药物与靶点的结合预测在近年来取得了很有前景的成果,其表现优于传统的基于机器学习的方法。然而,这些分类模型的泛化能力仍然是一个需要解决的问题。在这项工作中,我们探讨了应用于不同分子数据库的数据的不同交叉验证策略如何影响结合预测的药效组学模型的性能。这些策略是:(1)随机分割,(2)基于 K-均值聚类的分割(同时包括活性和非活性化合物),(3)基于源数据库的分割,以及(4)基于聚类和源数据库的分割。这些方案应用于深度学习药效组学模型和简单的逻辑回归模型作为基线。此外,还测试了模型中分子的两种不同描述方式:(1)SMILES 和(2)三种指纹。我们基于深度学习的药效组学模型的分类性能可与现有技术相媲美。我们的结果表明,这些模型的泛化能力不足是由于公共分子数据库中的偏差所致,并且基于化合物聚类的限制交叉验证方案会导致更差但更稳健和可信的结果。当用指纹来表示分子时,我们的结果显示出更好的性能。