蛋白质预训练语言模型是否有助于预测蛋白质-配体相互作用？

Does protein pretrained language model facilitate the prediction of protein-ligand interaction?

机构信息

Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China.

Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.

出版信息

Methods. 2023 Nov;219:8-15. doi: 10.1016/j.ymeth.2023.08.016. Epub 2023 Sep 9.

DOI:10.1016/j.ymeth.2023.08.016

PMID:37690736

Abstract

Protein-ligand interaction (PLI) is a critical step for drug discovery. Recently, protein pretrained language models (PLMs) have showcased exceptional performance across a wide range of protein-related tasks. However, a significant heterogeneity exists between the PLM and PLI tasks, leading to a degree of uncertainty. In this study, we propose a method that quantitatively assesses the significance of protein PLMs in PLI prediction. Specifically, we analyze the performance of three widely-used protein PLMs (TAPE, ESM-1b, and ProtTrans) on three PLI tasks (PDBbind, Kinase, and DUD-E). The model with pre-training consistently achieves improved performance and decreased time cost, demonstrating that enhance both the accuracy and efficiency of PLI prediction. By quantitatively assessing the transferability, the optimal PLM for each PLI task is identified without the need for costly transfer experiments. Additionally, we examine the contributions of PLMs on the distribution of feature space, highlighting the improved discriminability after pre-training. Our findings provide insights into the mechanisms underlying PLMs in PLI prediction and pave the way for the design of more interpretable and accurate PLMs in the future. Code and data are freely available at https://github.com/brian-zZZ/PLM-PLI.

摘要

蛋白质-配体相互作用（PLI）是药物发现的关键步骤。最近，蛋白质预训练语言模型（PLM）在广泛的蛋白质相关任务中表现出了卓越的性能。然而，PLM 和 PLI 任务之间存在显著的异质性，导致一定程度的不确定性。在这项研究中，我们提出了一种方法，可以定量评估蛋白质 PLM 在 PLI 预测中的重要性。具体来说，我们分析了三种广泛使用的蛋白质 PLM（TAPE、ESM-1b 和 ProtTrans）在三个 PLI 任务（PDBbind、Kinase 和 DUD-E）上的性能。预训练后的模型始终能提高性能并降低时间成本，这表明预训练可以增强 PLI 预测的准确性和效率。通过定量评估可转移性，无需进行昂贵的迁移实验，就可以确定每个 PLI 任务的最佳 PLM。此外，我们还研究了 PLM 在特征空间分布上的贡献，突出了预训练后提高的可区分性。我们的研究结果为 PLI 预测中 PLM 的作用机制提供了深入的了解，并为未来设计更具可解释性和准确性的 PLM 铺平了道路。代码和数据可在 https://github.com/brian-zZZ/PLM-PLI 上免费获取。