Wan Yue, Wu Jialu, Hou Tingjun, Hsieh Chang-Yu, Jia Xiaowei
University of Pittsburgh, Department of Computer Science, Pittsburgh, PA, 15260, USA.
Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
Nat Commun. 2025 Jan 6;16(1):413. doi: 10.1038/s41467-024-55082-4.
Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self-supervised learning (SSL) has emerged as a popular solution, utilizing large-scale, unannotated molecular data to learn a foundational representation of chemical space that might be advantageous for downstream tasks. Yet, existing molecular SSL methods largely overlook chemical knowledge, including molecular structure similarity, scaffold composition, and the context-dependent aspects of molecular properties when operating over the chemical space. They also struggle to learn the subtle variations in structure-activity relationship. This paper introduces a multi-channel pre-training framework that learns robust and generalizable chemical knowledge. It leverages the structural hierarchy within the molecule, embeds them through distinct pre-training tasks across channels, and aggregates channel information in a task-specific manner during fine-tuning. Our approach demonstrates competitive performance across various molecular property benchmarks and offers strong advantages in particularly challenging yet ubiquitous scenarios like activity cliffs.
可靠的分子性质预测对于各种科学研究和工业应用(如药物发现)至关重要。然而,数据稀缺,再加上物理化学性质与生物学性质之间高度非线性的因果关系以及传统的分子特征化方案,使得开发强大的分子机器学习模型变得复杂。自监督学习(SSL)已成为一种流行的解决方案,利用大规模、未标注的分子数据来学习化学空间的基础表示,这可能对下游任务有利。然而,现有的分子SSL方法在处理化学空间时,很大程度上忽略了化学知识,包括分子结构相似性、骨架组成以及分子性质的上下文相关方面。它们也难以学习结构-活性关系中的细微变化。本文介绍了一种多通道预训练框架,该框架可以学习到强大且通用的化学知识。它利用分子内部的结构层次,通过跨通道的不同预训练任务对其进行嵌入,并在微调期间以特定任务的方式聚合通道信息。我们的方法在各种分子性质基准测试中展现出具有竞争力的性能,并且在像活性悬崖这样特别具有挑战性但又普遍存在的场景中具有显著优势。