使用基础模型进行数据高效的分子图像表示学习。

Data efficient molecular image representation learning using foundation models.

作者信息

Harnik Yonatan, Shalit Peleg Hadas, Bermano Amit H, Milo Anat

机构信息

Department of Chemistry, Ben-Gurion University of the Negev Beer Sheva Israel

School of Computer Science, Tel Aviv University Tel Aviv Israel

出版信息

Chem Sci. 2025 May 22. doi: 10.1039/d5sc00907c.

DOI:10.1039/d5sc00907c

PMID:40417293

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12100517/

Abstract

Deep learning (DL) in chemistry has seen significant progress, yet its applicability is limited by the scarcity of large, labeled datasets and the difficulty of extracting meaningful molecular features. Molecular representation learning (MRL) has emerged as a powerful approach to address these challenges by decoupling feature extraction and property prediction. In MRL, a deep learning network is first trained to learn molecular features from large, unlabeled datasets and then finetuned for property prediction on smaller specialized data. Whereas MRL methods have been widely applied across chemical applications, these models are typically trained from scratch. Herein, we propose that foundation models can serve as an advantageous starting point for developing MRL models. Foundation models are large models trained on diverse datasets capable of addressing various downstream tasks. For example, large language models like OpenAI's GPT-4 can be finetuned with minimal additional data for tasks considerably different from their training. Based on this premise we leveraged OpenAI's vision foundation model, CLIP, as the backbone for developing MoleCLIP, a molecular image representation learning framework. MoleCLIP requires significantly less molecular pretraining data to match the performance of state-of-the-art models on standard benchmarks. Furthermore, MoleCLIP outperformed existing models on homogeneous catalysis datasets, emphasizing its robustness to distribution shifts, which allows it to adapt effectively to varied tasks and datasets. This successful application of a general foundation model to chemical tasks highlights the potential of innovations in DL research to advance synthetic chemistry and, more broadly, any field where molecular property description is central to discovery.

摘要

深度学习（DL）在化学领域已取得显著进展，但其适用性受到大型标注数据集稀缺以及提取有意义分子特征困难的限制。分子表示学习（MRL）已成为一种强大的方法，通过解耦特征提取和性质预测来应对这些挑战。在MRL中，首先训练一个深度学习网络，从大型未标注数据集中学习分子特征，然后在较小的专门数据上进行微调以进行性质预测。尽管MRL方法已在各种化学应用中广泛应用，但这些模型通常是从头开始训练的。在此，我们提出基础模型可以作为开发MRL模型的有利起点。基础模型是在能够处理各种下游任务的多样数据集上训练的大型模型。例如，像OpenAI的GPT - 4这样的大型语言模型可以用最少的额外数据进行微调，以用于与其训练有很大不同的任务。基于这一前提，我们利用OpenAI的视觉基础模型CLIP作为开发分子图像表示学习框架MoleCLIP的主干。MoleCLIP在标准基准测试中匹配最先进模型的性能所需的分子预训练数据要少得多。此外，MoleCLIP在均相催化数据集上优于现有模型，强调了其对分布变化的鲁棒性，这使其能够有效地适应各种任务和数据集。这种将通用基础模型成功应用于化学任务的方法凸显了深度学习研究中的创新潜力，可推动合成化学以及更广泛地说，任何以分子性质描述为发现核心的领域的发展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fca/12100517/214758e11819/d5sc00907c-f1.jpg

相似文献

Data efficient molecular image representation learning using foundation models.使用基础模型进行数据高效的分子图像表示学习。

Chem Sci. 2025 May 22. doi: 10.1039/d5sc00907c.

Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果：面向临床医生的网状Meta分析教程

Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.

Molecular feature-based classification of retroperitoneal liposarcoma: a prospective cohort study.基于分子特征的腹膜后脂肪肉瘤分类：一项前瞻性队列研究。

Elife. 2025 May 23;14:RP100887. doi: 10.7554/eLife.100887.

Learning hemodynamic scalar fields on coronary artery meshes: A benchmark of geometric deep learning models.在冠状动脉网格上学习血流动力学标量场：几何深度学习模型的基准测试

Comput Biol Med. 2025 Jun 17;195:110477. doi: 10.1016/j.compbiomed.2025.110477.

Stakeholders' perceptions and experiences of factors influencing the commissioning, delivery, and uptake of general health checks: a qualitative evidence synthesis.利益相关者对影响一般健康检查的委托、提供和接受因素的看法与体验：一项定性证据综合分析

Cochrane Database Syst Rev. 2025 Mar 20;3(3):CD014796. doi: 10.1002/14651858.CD014796.pub2.

Wood Waste Valorization and Classification Approaches: A systematic review.木材废料的增值与分类方法：一项系统综述

Open Res Eur. 2025 May 6;5:5. doi: 10.12688/openreseurope.18862.1. eCollection 2025.

Introducing the dataset for measuring centrality for sustainability-A case study of Pecinci municipality, Serbia.介绍用于衡量可持续性中心性的数据集——以塞尔维亚佩钦奇市为例

Data Brief. 2025 May 27;61:111714. doi: 10.1016/j.dib.2025.111714. eCollection 2025 Aug.

Primary Amine-Based Photoclick Chemistry: From Concept to Diverse Applications in Chemical Biology and Medicinal Chemistry.基于伯胺的光点击化学：从概念到化学生物学和药物化学中的多样应用

Acc Chem Res. 2025 Jun 18. doi: 10.1021/acs.accounts.5c00158.

Prediction, screening and characterization of novel bioactive tetrapeptide matrikines for skin rejuvenation.预测、筛选和鉴定具有皮肤年轻化功效的新型生物活性四肽基质。

Br J Dermatol. 2024 Jun 20;191(1):92-106. doi: 10.1093/bjd/ljae061.

Anti-VEGF drugs compared with laser photocoagulation for the treatment of proliferative diabetic retinopathy: a systematic review and individual participant data meta-analysis.抗血管内皮生长因子药物与激光光凝术治疗增殖性糖尿病视网膜病变的比较：一项系统评价和个体参与者数据荟萃分析

Health Technol Assess. 2025 Apr 2:1-75. doi: 10.3310/MJYP6578.

本文引用的文献

Machine learning-guided strategies for reaction conditions design and optimization.用于反应条件设计与优化的机器学习引导策略。

Beilstein J Org Chem. 2024 Oct 4;20:2476-2492. doi: 10.3762/bjoc.20.212. eCollection 2024.

Image-based molecular representation learning for drug development: a survey.基于图像的药物研发分子表示学习：综述。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae294.

A focus on molecular representation learning for the prediction of chemical properties.专注于用于化学性质预测的分子表示学习。

Chem Sci. 2024 Mar 25;15(14):5052-5055. doi: 10.1039/d4sc90043j. eCollection 2024 Apr 3.

Bridging the information gap in organic chemical reactions.弥合有机化学反应中的信息差距。

Nat Chem. 2024 Apr;16(4):491-498. doi: 10.1038/s41557-024-01470-8. Epub 2024 Mar 28.

Dataset Design for Building Models of Chemical Reactivity.用于构建化学反应性模型的数据集设计

ACS Cent Sci. 2023 Dec 8;9(12):2196-2204. doi: 10.1021/acscentsci.3c01163. eCollection 2023 Dec 27.

Autonomous chemical research with large language models.大语言模型驱动的自主化学研究。

Nature. 2023 Dec;624(7992):570-578. doi: 10.1038/s41586-023-06792-0. Epub 2023 Dec 20.

A systematic study of key elements underlying molecular property prediction.对分子性质预测背后关键要素的系统研究。

Nat Commun. 2023 Oct 13;14(1):6395. doi: 10.1038/s41467-023-41948-6.

Machine Learning Methods for Small Data Challenges in Molecular Science.机器学习方法在分子科学中小数据挑战中的应用。

Chem Rev. 2023 Jul 12;123(13):8736-8780. doi: 10.1021/acs.chemrev.3c00189. Epub 2023 Jun 29.

Small Data Can Play a Big Role in Chemical Discovery.小数据在化学发现中可发挥大作用。

Angew Chem Int Ed Engl. 2023 Jun 26;62(26):e202219070. doi: 10.1002/anie.202219070. Epub 2023 Apr 26.

Molecular Machine Learning for Chemical Catalysis: Prospects and Challenges.分子机器学习在化学催化中的应用：前景与挑战。

Acc Chem Res. 2023 Feb 7;56(3):402-412. doi: 10.1021/acs.accounts.2c00801. Epub 2023 Jan 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用基础模型进行数据高效的分子图像表示学习。

Data efficient molecular image representation learning using foundation models.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献