Suppr超能文献

与邻近性和聚类化合物相关。

Ties in proximity and clustering compounds.

作者信息

MacCuish J, Nicolaou C, MacCuish N E

机构信息

Bioreason, Inc., Santa Fe, New Mexico 87501, USA.

出版信息

J Chem Inf Comput Sci. 2001 Jan-Feb;41(1):134-46. doi: 10.1021/ci000069q.

Abstract

Hierarchical clustering algorithms such as Wards or complete-link are commonly used in compound selection and diversity analysis. Many such applications utilize binary representations of chemical structures, such as MACCS keys or Daylight fingerprints, and dissimilarity measures, such as the Euclidean or the Soergel measure. However, hierarchical clustering algorithms can generate ambiguous results owing to what is known in the cluster analysis literature as the ties in proximity problem, i.e., compounds or clusters of compounds that are equidistant from a compound or cluster in a given collection. Ambiguous ties can occur when clustering only a few hundred compounds, and the larger the number of compounds to be clustered, the greater the chance for significant ambiguity. Namely, as the number of "ties in proximity" increases relative to the total number of proximities, the possibility of ambiguity also increases. To ensure that there are no ambiguous ties, we show by a probabilistic argument that the number of compounds needs to be less than 2(n 1/4), where n is the total number of proximities, and the measure used to generate the proximities creates a uniform distribution without statistically preferred values. The common measures do not produce uniformly distributed proximities, but rather statistically preferred values that tend to increase the number of ties in proximity. Hence, the number of possible proximities and the distribution of statistically preferred values of a similarity measure, given a bit vector representation of a specific length, are directly related to the number of ties in proximities for a given data set. We explore the ties in proximity problem, using a number of chemical collections with varying degrees of diversity, given several common similarity measures and clustering algorithms. Our results are consistent with our probabilistic argument and show that this problem is significant for relatively small compound sets.

摘要

诸如沃德法或完全链接法等层次聚类算法常用于化合物筛选和多样性分析。许多此类应用利用化学结构的二进制表示形式,如MACCS键或Daylight指纹,以及相异度度量,如欧几里得度量或索尔格尔度量。然而,由于聚类分析文献中所知的邻近关系中的平局问题,即给定集合中与某个化合物或簇距离相等的化合物或化合物簇,层次聚类算法可能会产生模糊的结果。在对仅几百个化合物进行聚类时可能会出现模糊平局,并且要聚类的化合物数量越多,出现显著模糊性的可能性就越大。也就是说,随着“邻近关系中的平局”数量相对于邻近关系总数的增加,出现模糊性的可能性也会增加。为确保不存在模糊平局,我们通过概率论证表明化合物数量需要小于2(n的1/4次方),其中n是邻近关系的总数,并且用于生成邻近关系的度量会创建一个均匀分布,没有统计上的偏好值。常见的度量不会产生均匀分布的邻近关系,而是会产生统计上的偏好值,这些值往往会增加邻近关系中的平局数量。因此,对于给定长度的位向量表示,相似性度量的可能邻近关系数量和统计上偏好值的分布与给定数据集的邻近关系中的平局数量直接相关。我们使用一些具有不同程度多样性的化学集合,给定几种常见的相似性度量和聚类算法,来探讨邻近关系中的平局问题。我们的结果与我们的概率论证一致,并表明这个问题对于相对较小的化合物集来说是很重要的。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验