数据选择在深度学习中对可靠预测CYP3A4活性位点配体结合模式的意义。

Significance of Data Selection in Deep Learning for Reliable Binding Mode Prediction of Ligands in the Active Site of CYP3A4.

作者信息

Sato Atsuko, Tanimura Naoki, Honma Teruki, Konagaya Akihiko

机构信息

School of Computing, Department of Computer Science, Tokyo Institute of Technology.

Science Solutions Division, Mizuho Information & Research Institute, Inc.

出版信息

Chem Pharm Bull (Tokyo). 2019 Nov 1;67(11):1183-1190. doi: 10.1248/cpb.c19-00443. Epub 2019 Aug 17.

DOI:10.1248/cpb.c19-00443

PMID:31423003

Abstract

For rational drug design, it is essential to predict the binding mode of protein-ligand complexes. Although various machine learning-based models have been reported that use convolutional neural networks (deep learning) to predict binding modes from three-dimensional structures, there are few detailed reports on how best to construct and use datasets. Here, we examined how different datasets affected the prediction of the binding mode of CYP3A4 by a three-dimensional neural network when the number of crystal structures for the target protein was limited. We used four different training datasets: one large, general dataset containing various protein complexes and three smaller, more specific datasets containing complexes with CYP3A4-like pockets, complexes with CYP3A4-binding ligands, and complexes with CYP protein family members. We then trained models with different combinations of datasets with or without subsequent fine-tuning and evaluated the binding mode prediction performance of each model. The best receiver operating characteristic (ROC) area under the curve (AUC) model with respect to area under the receiver operating characteristic curve was obtained by training with a combination of the general protein and CYP family datasets. However, the ROC AUC-recall balanced model was obtained by training with this combination of datasets followed by fine-tuning with the CYP3A4-binding ligands dataset. Our results suggest that datasets that balance protein functionality and data size are important for optimizing binding mode prediction performance. In addition, datasets with large median binding pocket sizes may be important for the binding mode prediction specifically of CYP3A4.

摘要

对于合理药物设计而言，预测蛋白质-配体复合物的结合模式至关重要。尽管已有各种基于机器学习的模型报道，这些模型使用卷积神经网络（深度学习）从三维结构预测结合模式，但关于如何最佳构建和使用数据集的详细报告却很少。在此，我们研究了当目标蛋白的晶体结构数量有限时，不同数据集对三维神经网络预测CYP3A4结合模式的影响。我们使用了四个不同的训练数据集：一个包含各种蛋白质复合物的大型通用数据集，以及三个较小的、更具特异性的数据集，分别包含具有CYP3A4样口袋的复合物、具有CYP3A4结合配体的复合物以及具有CYP蛋白家族成员的复合物。然后，我们用不同的数据集组合训练模型，有或没有后续的微调，并评估每个模型的结合模式预测性能。通过使用通用蛋白质和CYP家族数据集的组合进行训练，获得了关于受试者工作特征曲线下面积（ROC AUC）方面最优的模型。然而，通过使用此数据集组合进行训练，然后用CYP3A4结合配体数据集进行微调，获得了ROC AUC-召回率平衡模型。我们的结果表明，平衡蛋白质功能和数据大小的数据集对于优化结合模式预测性能很重要。此外，具有较大中位结合口袋大小的数据集对于CYP3A4的结合模式预测可能很重要。