School of Computer Science and Technology, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, East China Normal University, Shanghai, China.
School of Computer Science and Technology, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, East China Normal University, Shanghai, China.
Comput Biol Chem. 2024 Oct;112:108137. doi: 10.1016/j.compbiolchem.2024.108137. Epub 2024 Jul 25.
Compound-protein interaction (CPI) prediction plays a crucial role in drug discovery and drug repositioning. Early researchers relied on time-consuming and labor-intensive wet laboratory experiments. However, the advent of deep learning has significantly accelerated this progress. Most existing deep learning methods utilize deep neural networks to extract compound features from sequences and graphs, either separately or in combination. Our team's previous research has demonstrated that compound images contain valuable information that can be leveraged for CPI task. However, there is a scarcity of multimodal methods that effectively combine sequence and image representations of compounds in CPI. Currently, the use of text-image pairs for contrastive language-image pre-training is a popular approach in the multimodal field. Further research is needed to explore how the integration of sequence and image representations can enhance the accuracy of CPI task.
This paper presents a novel method called MMCL-CPI, which encompasses two key highlights: 1) Firstly, we propose extracting compound features from two modalities: one-dimensional SMILES and two-dimensional images. This approach enables us to capture both sequence and spatial features, enhancing the prediction accuracy for CPI. Based on this, we design a novel multimodal model. 2) Secondly, we introduce a multimodal pre-training strategy that leverages comparative learning on a large-scale unlabeled dataset to establish the correspondence between SMILES string and compound's image. This pre-training approach significantly improves compound feature representations for downstream CPI task. Our method has shown competitive results on multiple datasets.
化合物-蛋白质相互作用(CPI)预测在药物发现和药物重定位中起着至关重要的作用。早期的研究人员依赖于耗时且费力的湿实验室实验。然而,深度学习的出现极大地加速了这一进程。大多数现有的深度学习方法利用深度神经网络分别或组合从序列和图中提取化合物特征。我们团队之前的研究表明,化合物图像包含可用于 CPI 任务的有价值信息。然而,有效的将化合物的序列和图像表示结合起来的多模态方法仍然很少。目前,在多模态领域中,使用文本-图像对进行对比语言-图像预训练是一种流行的方法。需要进一步研究如何整合序列和图像表示来提高 CPI 任务的准确性。
本文提出了一种名为 MMCL-CPI 的新方法,该方法有两个主要特点:1)首先,我们提出从两种模态中提取化合物特征:一维 SMILES 和二维图像。这种方法使我们能够同时捕捉序列和空间特征,提高 CPI 的预测准确性。基于此,我们设计了一种新的多模态模型。2)其次,我们引入了一种多模态预训练策略,利用大规模未标记数据集上的对比学习来建立 SMILES 字符串和化合物图像之间的对应关系。这种预训练方法显著提高了下游 CPI 任务的化合物特征表示。我们的方法在多个数据集上表现出了有竞争力的结果。