Universite de Lille, Villeneuve d'Ascq cedex, France.
J Comput Aided Mol Des. 2021 May;35(5):657-665. doi: 10.1007/s10822-021-00383-9. Epub 2021 Apr 2.
The line notations of chemical structures are more compact than those of graphs and connection tables, so they can be useful for storing and transferring a large number of molecular structures. The simplified molecular input line system (SMILES) representation is the most extensively used, as it is much easier to utilise and comprehend than others, and it can be generated automatically from connection tables. A SMILES represents and encodes the molecule structure. It has been used by an existing method, LINGO, to calculate the molecular similarities and predict the structure-related properties. The LINGO method decomposes a canonical SMILES into a set of substrings of four characters referred to as LINGOs. The purpose of LINGO method is to measure the similarity between a pair of molecules by comparing the LINGOs that occur in each molecule. This paper aims to introduce an alternative version of the LINGO method using LINGOs of different lengths, called LINGO-DL. LINGO-DL is based on the fragmentation of canonical SMILES into substrings of three different lengths rather than one in LINGO method. Retrospective virtual screening experiments with MDDR, DUD, and MUV datasets show that the LINGO-DL outperforms the LINGO method, especially when the active molecules being sought have a high degree of structural heterogeneity.
化学结构的线式符号比图形和连接表更紧凑,因此它们可用于存储和传输大量分子结构。简化分子输入行系统 (SMILES) 表示法是最广泛使用的,因为它比其他表示法更容易使用和理解,并且可以从连接表自动生成。SMILES 表示并编码分子结构。它已被现有的 LINGO 方法用于计算分子相似性和预测与结构相关的性质。LINGO 方法将规范的 SMILES 分解为一组四个字符的子字符串,称为 LINGOs。LINGO 方法的目的是通过比较每个分子中出现的 LINGOs 来衡量一对分子之间的相似性。本文旨在介绍一种使用不同长度的 LINGOs 的 LINGO 方法的替代版本,称为 LINGO-DL。LINGO-DL 基于将规范的 SMILES 分割成三个不同长度的子字符串,而不是 LINGO 方法中的一个子字符串。对 MDDR、DUD 和 MUV 数据集的回顾性虚拟筛选实验表明,LINGO-DL 优于 LINGO 方法,特别是当所寻找的活性分子具有高度的结构异质性时。