O'Hagan Steve, Kell Douglas B
School of Chemistry, The University of ManchesterManchester, UK; The Manchester Institute of Biotechnology, The University of ManchesterManchester, UK; Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals, The University of ManchesterManchester, UK.
Front Pharmacol. 2016 Aug 22;7:266. doi: 10.3389/fphar.2016.00266. eCollection 2016.
Previous studies compared the molecular similarity of marketed drugs and endogenous human metabolites (endogenites), using a series of fingerprint-type encodings, variously ranked and clustered using the Tanimoto (Jaccard) similarity coefficient (TS). Because this gives equal weight to all parts of the encoding (thence to different substructures in the molecule) it may not be optimal, since in many cases not all parts of the molecule will bind to their macromolecular targets. Unsupervised methods cannot alone uncover this. We here explore the kinds of differences that may be observed when the TS is replaced-in a manner more equivalent to semi-supervised learning-by variants of the asymmetric Tversky (TV) similarity, that includes α and β parameters.
Dramatic differences are observed in (i) the drug-endogenite similarity heatmaps, (ii) the cumulative "greatest similarity" curves, and (iii) the fraction of drugs with a Tversky similarity to a metabolite exceeding a given value when the Tversky α and β parameters are varied from their Tanimoto values. The same is true when the sum of the α and β parameters is varied. A clear trend toward increased endogenite-likeness of marketed drugs is observed when α or β adopt values nearer the extremes of their range, and when their sum is smaller. The kinds of molecules exhibiting the greatest similarity to two interrogating drug molecules (chlorpromazine and clozapine) also vary in both nature and the values of their similarity as α and β are varied. The same is true for the converse, when drugs are interrogated with an endogenite. The fraction of drugs with a Tversky similarity to a molecule in a library exceeding a given value depends on the contents of that library, and α and β may be "tuned" accordingly, in a semi-supervised manner. At some values of α and β drug discovery library candidates or natural products can "look" much more like (i.e., have a numerical similarity much closer to) drugs than do even endogenites.
Overall, the Tversky similarity metrics provide a more useful range of examples of molecular similarity than does the simpler Tanimoto similarity, and help to draw attention to molecular similarities that would not be recognized if Tanimoto alone were used. Hence, the Tversky similarity metrics are likely to be of significant value in many general problems in cheminformatics.
以往的研究使用一系列指纹型编码比较市售药物与内源性人体代谢物(内源性物质)的分子相似性,并使用Tanimoto(Jaccard)相似系数(TS)进行各种排序和聚类。由于这对编码的所有部分(进而对分子中的不同子结构)赋予同等权重,可能并非最优,因为在许多情况下,分子的并非所有部分都会与它们的大分子靶点结合。无监督方法无法单独揭示这一点。在此,我们探讨当以更类似于半监督学习的方式,用包含α和β参数的非对称Tversky(TV)相似性变体取代TS时,可能观察到的差异类型。
当Tversky的α和β参数从其Tanimoto值变化时,在(i)药物 - 内源性物质相似性热图、(ii)累积“最大相似性”曲线以及(iii)与代谢物的Tversky相似性超过给定值的药物比例方面观察到显著差异。当α和β参数之和变化时也是如此。当α或β采用更接近其范围极值的值且它们的和更小时,观察到市售药物的内源性物质相似性有明显增加的趋势。随着α和β的变化,与两种受试药物分子(氯丙嗪和氯氮平)表现出最大相似性的分子类型在性质和相似性值方面也有所不同。反之,当用内源性物质询问药物时也是如此。与库中分子的Tversky相似性超过给定值的药物比例取决于该库的内容,并且α和β可以以半监督的方式相应地“调整”。在α和β的某些值下,药物发现库候选物或天然产物甚至可能比内源性物质“看起来”更像(即,具有更接近的数值相似性)药物。
总体而言,与更简单的Tanimoto相似性相比,Tversky相似性度量提供了更有用的分子相似性示例范围,并有助于引起人们对仅使用Tanimoto时无法识别的分子相似性的关注。因此,Tversky相似性度量在化学信息学的许多一般问题中可能具有重要价值。