Harnik Yonatan, Shalit Peleg Hadas, Bermano Amit H, Milo Anat
Department of Chemistry, Ben-Gurion University of the Negev Beer Sheva Israel
School of Computer Science, Tel Aviv University Tel Aviv Israel
Chem Sci. 2025 May 22. doi: 10.1039/d5sc00907c.
Deep learning (DL) in chemistry has seen significant progress, yet its applicability is limited by the scarcity of large, labeled datasets and the difficulty of extracting meaningful molecular features. Molecular representation learning (MRL) has emerged as a powerful approach to address these challenges by decoupling feature extraction and property prediction. In MRL, a deep learning network is first trained to learn molecular features from large, unlabeled datasets and then finetuned for property prediction on smaller specialized data. Whereas MRL methods have been widely applied across chemical applications, these models are typically trained from scratch. Herein, we propose that foundation models can serve as an advantageous starting point for developing MRL models. Foundation models are large models trained on diverse datasets capable of addressing various downstream tasks. For example, large language models like OpenAI's GPT-4 can be finetuned with minimal additional data for tasks considerably different from their training. Based on this premise we leveraged OpenAI's vision foundation model, CLIP, as the backbone for developing MoleCLIP, a molecular image representation learning framework. MoleCLIP requires significantly less molecular pretraining data to match the performance of state-of-the-art models on standard benchmarks. Furthermore, MoleCLIP outperformed existing models on homogeneous catalysis datasets, emphasizing its robustness to distribution shifts, which allows it to adapt effectively to varied tasks and datasets. This successful application of a general foundation model to chemical tasks highlights the potential of innovations in DL research to advance synthetic chemistry and, more broadly, any field where molecular property description is central to discovery.
深度学习(DL)在化学领域已取得显著进展,但其适用性受到大型标注数据集稀缺以及提取有意义分子特征困难的限制。分子表示学习(MRL)已成为一种强大的方法,通过解耦特征提取和性质预测来应对这些挑战。在MRL中,首先训练一个深度学习网络,从大型未标注数据集中学习分子特征,然后在较小的专门数据上进行微调以进行性质预测。尽管MRL方法已在各种化学应用中广泛应用,但这些模型通常是从头开始训练的。在此,我们提出基础模型可以作为开发MRL模型的有利起点。基础模型是在能够处理各种下游任务的多样数据集上训练的大型模型。例如,像OpenAI的GPT - 4这样的大型语言模型可以用最少的额外数据进行微调,以用于与其训练有很大不同的任务。基于这一前提,我们利用OpenAI的视觉基础模型CLIP作为开发分子图像表示学习框架MoleCLIP的主干。MoleCLIP在标准基准测试中匹配最先进模型的性能所需的分子预训练数据要少得多。此外,MoleCLIP在均相催化数据集上优于现有模型,强调了其对分布变化的鲁棒性,这使其能够有效地适应各种任务和数据集。这种将通用基础模型成功应用于化学任务的方法凸显了深度学习研究中的创新潜力,可推动合成化学以及更广泛地说,任何以分子性质描述为发现核心的领域的发展。