• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

与邻近性和聚类化合物相关。

Ties in proximity and clustering compounds.

作者信息

MacCuish J, Nicolaou C, MacCuish N E

机构信息

Bioreason, Inc., Santa Fe, New Mexico 87501, USA.

出版信息

J Chem Inf Comput Sci. 2001 Jan-Feb;41(1):134-46. doi: 10.1021/ci000069q.

DOI:10.1021/ci000069q
PMID:11206366
Abstract

Hierarchical clustering algorithms such as Wards or complete-link are commonly used in compound selection and diversity analysis. Many such applications utilize binary representations of chemical structures, such as MACCS keys or Daylight fingerprints, and dissimilarity measures, such as the Euclidean or the Soergel measure. However, hierarchical clustering algorithms can generate ambiguous results owing to what is known in the cluster analysis literature as the ties in proximity problem, i.e., compounds or clusters of compounds that are equidistant from a compound or cluster in a given collection. Ambiguous ties can occur when clustering only a few hundred compounds, and the larger the number of compounds to be clustered, the greater the chance for significant ambiguity. Namely, as the number of "ties in proximity" increases relative to the total number of proximities, the possibility of ambiguity also increases. To ensure that there are no ambiguous ties, we show by a probabilistic argument that the number of compounds needs to be less than 2(n 1/4), where n is the total number of proximities, and the measure used to generate the proximities creates a uniform distribution without statistically preferred values. The common measures do not produce uniformly distributed proximities, but rather statistically preferred values that tend to increase the number of ties in proximity. Hence, the number of possible proximities and the distribution of statistically preferred values of a similarity measure, given a bit vector representation of a specific length, are directly related to the number of ties in proximities for a given data set. We explore the ties in proximity problem, using a number of chemical collections with varying degrees of diversity, given several common similarity measures and clustering algorithms. Our results are consistent with our probabilistic argument and show that this problem is significant for relatively small compound sets.

摘要

诸如沃德法或完全链接法等层次聚类算法常用于化合物筛选和多样性分析。许多此类应用利用化学结构的二进制表示形式,如MACCS键或Daylight指纹,以及相异度度量,如欧几里得度量或索尔格尔度量。然而,由于聚类分析文献中所知的邻近关系中的平局问题,即给定集合中与某个化合物或簇距离相等的化合物或化合物簇,层次聚类算法可能会产生模糊的结果。在对仅几百个化合物进行聚类时可能会出现模糊平局,并且要聚类的化合物数量越多,出现显著模糊性的可能性就越大。也就是说,随着“邻近关系中的平局”数量相对于邻近关系总数的增加,出现模糊性的可能性也会增加。为确保不存在模糊平局,我们通过概率论证表明化合物数量需要小于2(n的1/4次方),其中n是邻近关系的总数,并且用于生成邻近关系的度量会创建一个均匀分布,没有统计上的偏好值。常见的度量不会产生均匀分布的邻近关系,而是会产生统计上的偏好值,这些值往往会增加邻近关系中的平局数量。因此,对于给定长度的位向量表示,相似性度量的可能邻近关系数量和统计上偏好值的分布与给定数据集的邻近关系中的平局数量直接相关。我们使用一些具有不同程度多样性的化学集合,给定几种常见的相似性度量和聚类算法,来探讨邻近关系中的平局问题。我们的结果与我们的概率论证一致,并表明这个问题对于相对较小的化合物集来说是很重要的。

相似文献

1
Ties in proximity and clustering compounds.与邻近性和聚类化合物相关。
J Chem Inf Comput Sci. 2001 Jan-Feb;41(1):134-46. doi: 10.1021/ci000069q.
2
How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity.在层次聚类分析中,簇出现的频率如何?一种研究邻近关系中联系的图论方法。
J Cheminform. 2016 Jan 25;8:4. doi: 10.1186/s13321-016-0114-x. eCollection 2016.
3
Selecting diversified compounds to build a tangible library for biological and biochemical assays.选择多样化的化合物来构建用于生物和生化测定的有形文库。
Molecules. 2010 Jul 23;15(7):5031-44. doi: 10.3390/molecules15075031.
4
Novel symmetry-based gene-gene dissimilarity measures utilizing Gene Ontology: Application in gene clustering.基于新型对称的基因-基因相异度度量方法,并利用基因本体论:在基因聚类中的应用。
Gene. 2018 Dec 30;679:341-351. doi: 10.1016/j.gene.2018.08.062. Epub 2018 Sep 2.
5
Indefinite Proximity Learning: A Review.不确定邻近学习:综述
Neural Comput. 2015 Oct;27(10):2039-96. doi: 10.1162/NECO_a_00770. Epub 2015 Aug 27.
6
Comparison of similarity coefficients for clustering and compound selection.用于聚类和化合物选择的相似系数比较。
J Chem Inf Model. 2008 Mar;48(3):498-508. doi: 10.1021/ci700413a. Epub 2008 Feb 23.
7
GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness.GO 功能相似性聚类取决于相似性度量、聚类方法和注释完整性。
BMC Bioinformatics. 2019 Mar 27;20(1):155. doi: 10.1186/s12859-019-2752-2.
8
Cluster-based network proximities for arbitrary nodal subsets.基于聚类的任意节点子集网络邻近度。
Sci Rep. 2018 Sep 25;8(1):14371. doi: 10.1038/s41598-018-32172-0.
9
Geometry- and Accuracy-Preserving Random Forest Proximities.几何与精度保持随机森林近邻关系
IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10947-10959. doi: 10.1109/TPAMI.2023.3263774. Epub 2023 Aug 7.
10
LEGClust- a clustering algorithm based on layered entropic subgraphs.LEGClust——一种基于分层熵子图的聚类算法。
IEEE Trans Pattern Anal Mach Intell. 2008 Jan;30(1):62-75. doi: 10.1109/TPAMI.2007.1142.

引用本文的文献

1
Nonunique UPGMA clusterings of microsatellite markers.非特异的微卫星标记的 UPGMA 聚类。
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac312.
2
QSAR without borders.无边界定量构效关系。
Chem Soc Rev. 2020 Jun 7;49(11):3525-3564. doi: 10.1039/d0cs00098a. Epub 2020 May 1.
3
How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity.在层次聚类分析中,簇出现的频率如何?一种研究邻近关系中联系的图论方法。
J Cheminform. 2016 Jan 25;8:4. doi: 10.1186/s13321-016-0114-x. eCollection 2016.
4
JEDA: Joint entropy diversity analysis. An information-theoretic method for choosing diverse and representative subsets from combinatorial libraries.JEDA:联合熵多样性分析。一种从组合文库中选择多样且具代表性子集的信息论方法。
Mol Divers. 2006 Aug;10(3):333-9. doi: 10.1007/s11030-006-9042-4. Epub 2006 Sep 21.
5
Global analysis of large-scale chemical and biological experiments.大规模化学和生物学实验的全局分析
Curr Opin Drug Discov Devel. 2002 May;5(3):355-60.