Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States.
Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, United States.
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae533.
The identification and understanding of drug-target interactions (DTIs) play a pivotal role in the drug discovery and development process. Sequence representations of drugs and proteins in computational model offer advantages such as their widespread availability, easier input quality control, and reduced computational resource requirements. These make them an efficient and accessible tools for various computational biology and drug discovery applications. Many sequence-based DTI prediction methods have been developed over the years. Despite the advancement in methodology, cold start DTI prediction involving unknown drug or protein remains a challenging task, particularly for sequence-based models. Introducing DTI-LM, a novel framework leveraging advanced pretrained language models, we harness their exceptional context-capturing abilities along with neighborhood information to predict DTIs. DTI-LM is specifically designed to rely solely on sequence representations for drugs and proteins, aiming to bridge the gap between warm start and cold start predictions.
Large-scale experiments on four datasets show that DTI-LM can achieve state-of-the-art performance on DTI predictions. Notably, it excels in overcoming the common challenges faced by sequence-based models in cold start predictions for proteins, yielding impressive results. The incorporation of neighborhood information through a graph attention network further enhances prediction accuracy. Nevertheless, a disparity persists between cold start predictions for proteins and drugs. A detailed examination of DTI-LM reveals that language models exhibit contrasting capabilities in capturing similarities between drugs and proteins.
Source code is available at: https://github.com/compbiolabucf/DTI-LM.
药物-靶标相互作用(DTI)的识别和理解在药物发现和开发过程中起着关键作用。药物和蛋白质的计算模型中的序列表示具有广泛可用性、更容易的输入质量控制和降低的计算资源需求等优势。这使它们成为各种计算生物学和药物发现应用的有效且易于访问的工具。多年来已经开发了许多基于序列的 DTI 预测方法。尽管在方法学上取得了进展,但涉及未知药物或蛋白质的冷启动 DTI 预测仍然是一项具有挑战性的任务,特别是对于基于序列的模型。引入 DTI-LM,这是一种利用先进的预训练语言模型的新框架,我们利用它们出色的上下文捕获能力和邻域信息来预测 DTI。DTI-LM 专门设计仅依赖于药物和蛋白质的序列表示,旨在弥合暖启动和冷启动预测之间的差距。
在四个数据集上进行的大规模实验表明,DTI-LM 可以在 DTI 预测方面实现最先进的性能。值得注意的是,它在克服基于序列的模型在蛋白质冷启动预测中面临的常见挑战方面表现出色,取得了令人印象深刻的结果。通过图注意网络纳入邻域信息进一步提高了预测准确性。然而,蛋白质和药物的冷启动预测之间仍然存在差异。对 DTI-LM 的详细检查表明,语言模型在捕获药物和蛋白质之间的相似性方面表现出不同的能力。
源代码可在:https://github.com/compbiolabucf/DTI-LM. 获得。