Nguyen Long D, Nguyen Quang H, Trinh Quang H, Nguyen Binh P
School of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, Hanoi 100000, Vietnam.
School of Mathematics and Statistics, Victoria University of Wellington, Kelburn Parade, Wellington 6012, New Zealand.
J Chem Inf Model. 2024 Dec 23;64(24):9173-9195. doi: 10.1021/acs.jcim.4c01240. Epub 2024 Dec 6.
We present a novel molecular property prediction framework that requires only the SMILES format as input but is designed to be multimodal by incorporating predicted 3D conformer representations. Our model captures comprehensive molecular features by leveraging both the sequential character structure of SMILES and the three-dimensional spatial structure of conformers. The framework employs contrastive learning techniques, utilizing InfoNCE loss to align SMILES and conformer embeddings, along with task-specific loss functions, such as ConR for regression and SupCon for classification. To address data imbalance, we incorporate feature distribution smoothing (FDS), a common challenge in drug discovery. We evaluated the framework through multiple case studies, including SARS-CoV-2 drug docking score prediction, molecular property prediction using MoleculeNet data sets, and kinase inhibitor prediction for JAK-1, JAK-2, and MAPK-14 using custom data sets curated from PubChem. The results consistently outperformed state-of-the-art methods, with ConR and FDS significantly improving regression tasks and SupCon enhancing classification performance. These findings highlight the flexibility and robustness of our multimodal model, demonstrating its effectiveness across diverse molecular property prediction tasks, with promising applications in drug discovery and molecular analysis.
我们提出了一种新颖的分子特性预测框架,该框架仅需以SMILES格式作为输入,但通过整合预测的3D构象表示设计为多模态的。我们的模型通过利用SMILES的序列字符结构和构象的三维空间结构来捕获全面的分子特征。该框架采用对比学习技术,利用InfoNCE损失来对齐SMILES和构象嵌入,以及特定任务的损失函数,如用于回归的ConR和用于分类的SupCon。为了解决数据不平衡问题,我们纳入了特征分布平滑(FDS),这是药物发现中的一个常见挑战。我们通过多个案例研究对该框架进行了评估,包括SARS-CoV-2药物对接分数预测、使用MoleculeNet数据集的分子特性预测,以及使用从PubChem策划的自定义数据集对JAK-1、JAK-2和MAPK-14进行激酶抑制剂预测。结果始终优于现有方法,ConR和FDS显著改善了回归任务,SupCon提高了分类性能。这些发现突出了我们多模态模型的灵活性和稳健性,证明了其在各种分子特性预测任务中的有效性,在药物发现和分子分析中具有广阔的应用前景。