Suppr超能文献

深度分子表征中的属性驱动定位与表征

Property-driven localization and characterization in deep molecular representations.

作者信息

Cintas Celia, Das Payel, Ross Jerret, Belgodere Brian, Tadesse Girmaw Abebe, Chenthamarakshan Vijil, Born Jannis, Speakman Skyler

机构信息

IBM Research, Nairobi, Kenya.

IBM Research - Thomas J. Watson Research Center, Yorktown Heights, USA.

出版信息

Sci Rep. 2025 Aug 11;15(1):29365. doi: 10.1038/s41598-025-09717-1.

Abstract

Representation learning via pre-trained deep learning models is emerging as an integral method for studying the molecular structure-property relationship, which is then leveraged to predict molecular properties or design new molecules with desired attributes. We propose an unsupervised method to localize and characterize representations of pre-trained models through the lens of non-parametric property-driven subset scanning (PDSS), to improve the interpretability of deep molecular representations. We assess its detection capabilities on diverse molecular benchmarks (ZINC-250K, MOSES, MoleculeNet, FlavorDB, M2OR) across predictive chemical language models (MoLFormer, ChemBERTa) and molecular graph generative models (GraphAF, GCPN). We further study how representations evolve due to domain adaptation, and we evaluate the usefulness of the extracted property-driven elements in the embeddings as lower-dimension representations for downstream tasks. Experiments reveal notable information condensation in the pre-trained embeddings upon task-specific fine-tuning. For example, among the property-driven elements found in the embedding (out of [Formula: see text]), only 11 are shared between three distinct tasks (BACE, BBBP, and HIV), while [Formula: see text]-80 of those are unique to each task. Similar patterns are found for flavor and odor detection tasks. When we use the discovered property-driven elements as features for a new task, we find the same or improved performance (3 points up) while reducing the dimensions by 75% without fine-tuning required, thus indicating information localization.

摘要

通过预训练深度学习模型进行表示学习正在成为研究分子结构-性质关系的一种不可或缺的方法,该方法随后被用于预测分子性质或设计具有所需属性的新分子。我们提出了一种无监督方法,通过非参数性质驱动子集扫描(PDSS)的视角来定位和表征预训练模型的表示,以提高深度分子表示的可解释性。我们在各种分子基准(ZINC-250K、MOSES、MoleculeNet、FlavorDB、M2OR)上评估其在预测化学语言模型(MoLFormer、ChemBERTa)和分子图生成模型(GraphAF、GCPN)中的检测能力。我们进一步研究表示如何因域适应而演变,并评估嵌入中提取的性质驱动元素作为下游任务的低维表示的有用性。实验表明,在特定任务微调后,预训练嵌入中存在显著的信息压缩。例如,在嵌入中发现的性质驱动元素(在[公式:见正文]中)中,只有11个在三个不同任务(BACE、BBBP和HIV)之间共享,而其中[公式:见正文]-80个是每个任务独有的。在风味和气味检测任务中也发现了类似的模式。当我们将发现的性质驱动元素用作新任务的特征时,我们发现性能相同或有所提高(提高3分),同时在无需微调的情况下将维度降低75%,从而表明信息定位。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f03f/12339718/d68270d610d1/41598_2025_9717_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验