Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai 200438, China.
Tencent AI Lab, Shenzhen 518063, China.
J Chem Inf Model. 2024 Apr 8;64(7):2921-2930. doi: 10.1021/acs.jcim.3c01707. Epub 2023 Dec 25.
Self-supervised pretrained models are gaining increasingly more popularity in AI-aided drug discovery, leading to more and more pretrained models with the promise that they can extract better feature representations for molecules. Yet, the quality of learned representations has not been fully explored. In this work, inspired by the two phenomena of Activity Cliffs (ACs) and Scaffold Hopping (SH) in traditional Quantitative Structure-Activity Relationship analysis, we propose a method named presentation-roperty elationship nalysis (RePRA) to evaluate the quality of the representations extracted by the pretrained model and visualize the relationship between the representations and properties. The concepts of ACs and SH are generalized from the structure-activity context to the representation-property context, and the underlying principles of RePRA are analyzed theoretically. Two scores are designed to measure the generalized ACs and SH detected by RePRA, and therefore, the quality of representations can be evaluated. In experiments, representations of molecules from 10 target tasks generated by 7 pretrained models are analyzed. The results indicate that the state-of-the-art pretrained models can overcome some shortcomings of canonical Extended-Connectivity FingerPrints, while the correlation between the basis of the representation space and specific molecular substructures are not explicit. Thus, some representations could be even worse than the canonical fingerprints. Our method enables researchers to evaluate the quality of molecular representations generated by their proposed self-supervised pretrained models. And our findings can guide the community to develop better pretraining techniques to regularize the occurrence of ACs and SH.
自监督预训练模型在人工智能辅助药物发现中越来越受欢迎,导致越来越多的预训练模型承诺能够为分子提取更好的特征表示。然而,学习表示的质量尚未得到充分探索。在这项工作中,受传统定量构效关系分析中活性悬崖(ACs)和支架跳跃(SH)两种现象的启发,我们提出了一种名为表示-性质关系分析(RePRA)的方法,用于评估预训练模型提取的表示质量,并可视化表示和性质之间的关系。ACs 和 SH 的概念从结构-活性上下文推广到表示-性质上下文,并且从理论上分析了 RePRA 的基本原理。设计了两个分数来衡量 RePRA 检测到的广义 ACs 和 SH,从而可以评估表示的质量。在实验中,分析了 7 个预训练模型生成的 10 个目标任务的分子表示。结果表明,最先进的预训练模型可以克服经典扩展连接指纹的一些缺点,而表示空间的基与特定分子亚结构之间的相关性并不明显。因此,一些表示甚至可能比经典指纹更差。我们的方法使研究人员能够评估他们提出的自监督预训练模型生成的分子表示的质量。并且我们的发现可以指导社区开发更好的预训练技术来规范 ACs 和 SH 的发生。