Department of Chemistry, University of Florida, Gainesville, FL 32603, USA.
Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117, Budapest, Hungary.
Mol Inform. 2021 Jul;40(7):e2060017. doi: 10.1002/minf.202060017. Epub 2021 Apr 23.
Similarity measures are widely used in various areas from taxonomy to cheminformatics. To this end, a large number of similarity and distance measures (or, collectively, comparative measures) have been introduced, with only a few studies directed to revealing their inner relationships. We present a thorough analytical study of the conditions leading to two comparative measures providing equivalent results over a given set of molecules. A key part of this work is the introduction of a novel way to study the consistency between comparative measures: the differential consistency analysis (DCA). This tool reveals how the consistency can be established in an analytical way with minimal (or no) assumptions. We found that the consensus between Tanimoto and the Cosine coefficients improved by choosing a reference whose similarity to the rest of the molecules varies less, or by representing the molecules in a way that does not depend strongly on their size (i. e. bit frequency in the chosen fingerprint representation). The presented derivations are just some generic examples; DCA can be applied widely and for all binary similarity coefficients introduced so far, independently from the molecular representations.
相似性度量在从分类学到化学信息学的各个领域都有广泛的应用。为此,已经引入了大量的相似性和距离度量(或统称为比较度量),但只有少数研究致力于揭示它们的内在关系。我们对导致两个比较度量在给定分子集上产生等效结果的条件进行了全面的分析研究。这项工作的一个关键部分是引入了一种新的方法来研究比较度量之间的一致性:差分一致性分析(DCA)。该工具揭示了如何以最小(或无)假设的方式以分析方式建立一致性。我们发现,通过选择与其余分子相似性变化较小的参考物,或者通过以不强烈依赖于分子大小的方式(即所选指纹表示中的位频率)表示分子,Tanimoto 和余弦系数之间的一致性得到了改善。所提出的推导只是一些通用示例;DCA 可以广泛应用于迄今为止引入的所有二进制相似性系数,并且与分子表示无关。