Ong Wern Juin Gabriel, Kirubakaran Palani, Karanicolas John
Cancer Signaling & Microenvironment Program, Fox Chase Cancer Center, Philadelphia, PA 19111.
Bowdoin College, Brunswick, ME 04011.
bioRxiv. 2023 Sep 6:2023.09.04.556234. doi: 10.1101/2023.09.04.556234.
The extreme surge of interest over the past decade surrounding the use of neural networks has inspired many groups to deploy them for predicting binding affinities of drug-like molecules to their receptors. A model that can accurately make such predictions has the potential to screen large chemical libraries and help streamline the drug discovery process. However, despite reports of models that accurately predict quantitative inhibition using protein kinase sequences and inhibitors' SMILES strings, it is still unclear whether these models can generalize to previously unseen data. Here, we build a Convolutional Neural Network (CNN) analogous to those previously reported and evaluate the model over four datasets commonly used for inhibitor/kinase predictions. We find that the model performs comparably to those previously reported, provided that the individual data points are randomly split between the training set and the test set. However, model performance is dramatically deteriorated when all data for a given inhibitor is placed together in the same training/testing fold, implying that information leakage underlies the models' performance. Through comparison to simple models in which the SMILES strings are tokenized, or in which test set predictions are simply copied from the closest training set data points, we demonstrate that there is essentially no generalization whatsoever in this model. In other words, the model has not learned anything about molecular interactions, and does not provide any benefit over much simpler and more transparent models. These observations strongly point to the need for richer structure-based encodings, to obtain useful prospective predictions of not-yet-synthesized candidate inhibitors.
在过去十年中,围绕神经网络的应用掀起了一股热潮,这促使许多团队将其用于预测类药物分子与受体的结合亲和力。一个能够准确做出此类预测的模型有潜力筛选大型化学文库,并有助于简化药物发现过程。然而,尽管有报道称某些模型能利用蛋白激酶序列和抑制剂的SMILES字符串准确预测定量抑制作用,但这些模型是否能推广到未见数据仍不清楚。在此,我们构建了一个类似于先前报道的卷积神经网络(CNN),并在四个常用于抑制剂/激酶预测的数据集上评估该模型。我们发现,只要将各个数据点随机分配到训练集和测试集,该模型的表现与先前报道的模型相当。然而,当将给定抑制剂的所有数据集中在同一训练/测试折时,模型性能会急剧下降,这意味着信息泄露是模型性能的基础。通过与将SMILES字符串进行标记化的简单模型,或测试集预测仅仅从最接近的训练集数据点复制的简单模型进行比较,我们证明该模型基本上没有任何泛化能力。换句话说,该模型没有学到任何关于分子相互作用的知识,并且与更简单、更透明的模型相比没有任何优势。这些观察结果强烈表明,需要更丰富的基于结构的编码,以获得对尚未合成的候选抑制剂的有用前瞻性预测。