Yoshimori Atsushi, Bajorath Jürgen
Institute for Theoretical Medicine, Inc., 26-1 Muraoka-Higashi 2-Chome, Fujisawa, Kanagawa, 251-0012, Japan.
Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, University of Bonn, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
J Cheminform. 2025 May 26;17(1):83. doi: 10.1186/s13321-025-01032-1.
Similarity searching is a mainstay in cheminformatics that is generally used to identify compounds with desired properties. For small molecular fragments, similarity calculations based on standard descriptors often have limited utility for establishing meaningful similarity relationships due to feature sparseness. As an alternative, we have adapted the concept of context-depending word pair similarity from natural language processing to evaluate similarity relationships between substituents (R-groups) taking latent characteristics into account. Context-dependent similarity assessment is based on vector embeddings as fragment representations generated using neural networks. With active analogue series as a model system to establish a global structure-activity context, we demonstrate that this approach is applicable to systematic similarity searching for substituents and increases the performance of standard descriptor representations. Context-dependent similarity searching is capable of detecting remote and functionally relevant similarity relationships between substituents. Alternative search queries are introduced focusing on individual substituents within a global substituent context or individual sequences of substituents establishing a local context. For similarity searching, different structural or structure-property contexts can be established, providing opportunities for various applications.
相似性搜索是化学信息学的一项主要技术,通常用于识别具有所需特性的化合物。对于小分子片段,由于特征稀疏,基于标准描述符的相似性计算在建立有意义的相似性关系方面往往效用有限。作为一种替代方法,我们借鉴了自然语言处理中上下文相关词对相似性的概念,以评估取代基(R基团)之间的相似性关系,并考虑潜在特征。上下文相关相似性评估基于使用神经网络生成的向量嵌入作为片段表示。以活性类似物系列作为建立全局结构-活性上下文的模型系统,我们证明了这种方法适用于取代基的系统相似性搜索,并提高了标准描述符表示的性能。上下文相关相似性搜索能够检测取代基之间遥远且功能相关的相似性关系。引入了替代搜索查询,重点关注全局取代基上下文中的单个取代基或建立局部上下文的取代基的单个序列。对于相似性搜索,可以建立不同的结构或结构-性质上下文,为各种应用提供机会。