Suppr超能文献

使用基础模型进行数据高效的分子图像表示学习。

Data efficient molecular image representation learning using foundation models.

作者信息

Harnik Yonatan, Shalit Peleg Hadas, Bermano Amit H, Milo Anat

机构信息

Department of Chemistry, Ben-Gurion University of the Negev Beer Sheva Israel

School of Computer Science, Tel Aviv University Tel Aviv Israel

出版信息

Chem Sci. 2025 May 22. doi: 10.1039/d5sc00907c.

Abstract

Deep learning (DL) in chemistry has seen significant progress, yet its applicability is limited by the scarcity of large, labeled datasets and the difficulty of extracting meaningful molecular features. Molecular representation learning (MRL) has emerged as a powerful approach to address these challenges by decoupling feature extraction and property prediction. In MRL, a deep learning network is first trained to learn molecular features from large, unlabeled datasets and then finetuned for property prediction on smaller specialized data. Whereas MRL methods have been widely applied across chemical applications, these models are typically trained from scratch. Herein, we propose that foundation models can serve as an advantageous starting point for developing MRL models. Foundation models are large models trained on diverse datasets capable of addressing various downstream tasks. For example, large language models like OpenAI's GPT-4 can be finetuned with minimal additional data for tasks considerably different from their training. Based on this premise we leveraged OpenAI's vision foundation model, CLIP, as the backbone for developing MoleCLIP, a molecular image representation learning framework. MoleCLIP requires significantly less molecular pretraining data to match the performance of state-of-the-art models on standard benchmarks. Furthermore, MoleCLIP outperformed existing models on homogeneous catalysis datasets, emphasizing its robustness to distribution shifts, which allows it to adapt effectively to varied tasks and datasets. This successful application of a general foundation model to chemical tasks highlights the potential of innovations in DL research to advance synthetic chemistry and, more broadly, any field where molecular property description is central to discovery.

摘要

深度学习(DL)在化学领域已取得显著进展,但其适用性受到大型标注数据集稀缺以及提取有意义分子特征困难的限制。分子表示学习(MRL)已成为一种强大的方法,通过解耦特征提取和性质预测来应对这些挑战。在MRL中,首先训练一个深度学习网络,从大型未标注数据集中学习分子特征,然后在较小的专门数据上进行微调以进行性质预测。尽管MRL方法已在各种化学应用中广泛应用,但这些模型通常是从头开始训练的。在此,我们提出基础模型可以作为开发MRL模型的有利起点。基础模型是在能够处理各种下游任务的多样数据集上训练的大型模型。例如,像OpenAI的GPT - 4这样的大型语言模型可以用最少的额外数据进行微调,以用于与其训练有很大不同的任务。基于这一前提,我们利用OpenAI的视觉基础模型CLIP作为开发分子图像表示学习框架MoleCLIP的主干。MoleCLIP在标准基准测试中匹配最先进模型的性能所需的分子预训练数据要少得多。此外,MoleCLIP在均相催化数据集上优于现有模型,强调了其对分布变化的鲁棒性,这使其能够有效地适应各种任务和数据集。这种将通用基础模型成功应用于化学任务的方法凸显了深度学习研究中的创新潜力,可推动合成化学以及更广泛地说,任何以分子性质描述为发现核心的领域的发展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fca/12100517/214758e11819/d5sc00907c-f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验