Anhui Province Key Lab of Big Data Analysis and Application, University of Science and Technology of China, JinZhai Road, 230026, Anhui, China.
Tencent Quantum Laboratory, Tencent, Shennan Road, 518057, Guangdong, China.
Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad451.
Protein-ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional (3D) structure of protein-ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still limited. Although there is a vast amount of affinity data available in large-scale databases such as ChEMBL, issues such as inconsistent affinity measurement labels (i.e. IC50, Ki, Kd), different experimental conditions, and the lack of available 3D binding structures complicate the development of high-precision affinity prediction models using these data. To address these issues, we (i) propose Multi-task Bioassay Pre-training (MBP), a pre-training framework for structure-based PLBA prediction; (ii) construct a pre-training dataset called ChEMBL-Dock with more than 300k experimentally measured affinity labels and about 2.8M docked 3D structures. By introducing multi-task pre-training to treat the prediction of different affinity labels as different tasks and classifying relative rankings between samples from the same bioassay, MBP learns robust and transferrable structural knowledge from our new ChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the capability of MBP on the structure-based PLBA prediction task. To the best of our knowledge, MBP is the first affinity pre-training model and shows great potential for future development. MBP web-server is now available for free at: https://huggingface.co/spaces/jiaxianustc/mbp.
蛋白质-配体结合亲和力(PLBA)预测是药物发现的基本任务。最近,各种基于深度学习的模型通过将蛋白质-配体复合物的三维(3D)结构作为输入来预测结合亲和力,并取得了惊人的进展。然而,由于高质量训练数据的稀缺,当前模型的泛化能力仍然有限。尽管在 ChEMBL 等大型数据库中存在大量亲和力数据,但存在问题,如亲和力测量标签(即 IC50、Ki、Kd)不一致、不同的实验条件以及可用的 3D 结合结构缺乏,这使得使用这些数据开发高精度亲和力预测模型变得复杂。为了解决这些问题,我们(i)提出了多任务生物测定预训练(MBP),这是一种基于结构的 PLBA 预测的预训练框架;(ii)构建了一个名为 ChEMBL-Dock 的预训练数据集,其中包含超过 30 万个经过实验测量的亲和力标签和约 280 万个对接的 3D 结构。通过引入多任务预训练,将不同亲和力标签的预测视为不同任务,并对来自同一生物测定的样本进行相对排序分类,MBP 从我们具有不同且嘈杂标签的新 ChEMBL-Dock 数据集中学到了稳健且可转移的结构知识。实验证实了 MBP 在基于结构的 PLBA 预测任务上的能力。据我们所知,MBP 是第一个亲和力预训练模型,为未来的发展展示了巨大的潜力。MBP 网络服务器现在可在 https://huggingface.co/spaces/jiaxianustc/mbp 免费使用。