相似文献

1

Counting clusters using R-NN curves.

J Chem Inf Model. 2007 Jul-Aug;47(4):1308-18. doi: 10.1021/ci600541f. Epub 2007 Jun 30.

2

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

3

A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality.

Proc Int Jt Conf Neural Netw. 2012 Feb 8;43(6):2351-2358. doi: 10.1016/j.patcog.2010.01.003.

4

Silhouette width using generalized mean-A flexible method for assessing clustering efficiency.

Ecol Evol. 2019 Nov 19;9(23):13231-13243. doi: 10.1002/ece3.5774. eCollection 2019 Dec.

5

Subtyping of children with developmental dyslexia via bootstrap aggregated clustering and the gap statistic: comparison with the double-deficit hypothesis.

Int J Lang Commun Disord. 2007 Jan-Feb;42(1):77-95. doi: 10.1080/13682820600806680.

6

Comparison of five cluster validity indices performance in brain [ F]FET-PET image segmentation using k-means.

Med Phys. 2017 Jan;44(1):209-220. doi: 10.1002/mp.12025.

7

Testing Outlier Detection Algorithms for Identifying Early Stage Solute Clusters in Atom Probe Tomography.

Microsc Microanal. 2024 Nov 4;30(5):853-865. doi: 10.1093/mam/ozae076.

8

SillyPutty: Improved clustering by optimizing the silhouette width.

PLoS One. 2024 Jun 7;19(6):e0300358. doi: 10.1371/journal.pone.0300358. eCollection 2024.

9

Automatic Spectroscopic Data Categorization by Clustering Analysis (ASCLAN): A Data-Driven Approach for Distinguishing Discriminatory Metabolites for Phenotypic Subclasses.

Anal Chem. 2016 Jun 7;88(11):5670-9. doi: 10.1021/acs.analchem.5b04020. Epub 2016 May 13.

10

K-Means Clustering With Natural Density Peaks for Discovering Arbitrary-Shaped Clusters.

IEEE Trans Neural Netw Learn Syst. 2024 Aug;35(8):11077-11090. doi: 10.1109/TNNLS.2023.3248064. Epub 2024 Aug 5.

引用本文的文献

1

L. plants lacking reallocate carbon from monoterpenes to sesquiterpenes except artemisinin.

Front Plant Sci. 2022 Oct 12;13:1000819. doi: 10.3389/fpls.2022.1000819. eCollection 2022.

本文引用的文献

1

A cluster separation measure.

IEEE Trans Pattern Anal Mach Intell. 1979 Feb;1(2):224-7.

2

A generalization of Poisson's binomial limit for use in ecology.

Biometrika. 1949 Jun;36(Pt. 1-2):18-25.

3

A cluster-based strategy for assessing the overlap between large chemical libraries and its application to a recent acquisition.

J Chem Inf Model. 2006 Nov-Dec;46(6):2651-60. doi: 10.1021/ci600219n.

4

A fast clustering algorithm for analyzing highly similar compounds of very large libraries.

J Chem Inf Model. 2006 Sep-Oct;46(5):1919-23. doi: 10.1021/ci0600859.

5

R-NN curves: an intuitive approach to outlier detection using a distance based method.

J Chem Inf Model. 2006 Jul-Aug;46(4):1713-22. doi: 10.1021/ci060013h.

6

Visualization of large-scale aqueous solubility data using a novel hierarchical data visualization technique.

J Chem Inf Model. 2006 May-Jun;46(3):1054-9. doi: 10.1021/ci0504770.

7

Robust ligand-based modeling of the biological targets of known drugs.

J Med Chem. 2006 May 18;49(10):2921-38. doi: 10.1021/jm051139t.

8

Are clusters found in one dataset present in another dataset?

Biostatistics. 2007 Jan;8(1):9-31. doi: 10.1093/biostatistics/kxj029. Epub 2006 Apr 12.

9

A comparative study on the application of hierarchical-agglomerative clustering approaches to organize outputs of reiterated docking runs.

J Chem Inf Model. 2006 Mar-Apr;46(2):852-62. doi: 10.1021/ci050141q.

10

A novel search engine for virtual screening of very large databases.

J Chem Inf Model. 2006 Mar-Apr;46(2):836-43. doi: 10.1021/ci050458q.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。

使用R-NN曲线对聚类进行计数。

Counting clusters using R-NN curves.

作者信息

Guha Rajarshi, Dutta Debojyoti, Wild David J, Chen Ting

机构信息

School of Informatics, Indiana University, Bloomington, Indiana 47406, USA.

出版信息

J Chem Inf Model. 2007 Jul-Aug;47(4):1308-18. doi: 10.1021/ci600541f. Epub 2007 Jun 30.

DOI:10.1021/ci600541f

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2543137/

Abstract

Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713-722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters.

摘要

聚类是化学信息学领域的一项常见任务。对于非层次聚类方法（如k均值聚类），一个需要设置的关键参数是聚类数k。传统上，k的值是通过使用不同的k值进行聚类并选择导致最优聚类的那个值来获得的。在本研究中，我们描述了一种基于Guha等人（《化学信息与建模杂志》，2006年，46卷，1713 - 722页）描述的R - NN曲线算法来先验选择k的方法，该算法使用最近邻技术来表征化合物在任意描述符空间中的空间位置。该算法为数据集生成一组曲线，然后对这些曲线进行分析以估计聚类的自然数量。然后，我们使用预测的k值以及相似的值进行k均值聚类，以检查是否获得了正确的聚类数。此外，我们将预测值与作为聚类质量度量的平均轮廓宽度所指示的数量进行了比较。我们在模拟数据以及两个化学数据集上测试了该算法。我们的结果表明，R - NN曲线算法能够确定聚类的自然数量，并且在确定最优聚类数方面与平均轮廓宽度总体上一致。