Guha Rajarshi, Dutta Debojyoti, Wild David J, Chen Ting
School of Informatics, Indiana University, Bloomington, Indiana 47406, USA.
J Chem Inf Model. 2007 Jul-Aug;47(4):1308-18. doi: 10.1021/ci600541f. Epub 2007 Jun 30.
Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713-722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters.
聚类是化学信息学领域的一项常见任务。对于非层次聚类方法(如k均值聚类),一个需要设置的关键参数是聚类数k。传统上,k的值是通过使用不同的k值进行聚类并选择导致最优聚类的那个值来获得的。在本研究中,我们描述了一种基于Guha等人(《化学信息与建模杂志》,2006年,46卷,1713 - 722页)描述的R - NN曲线算法来先验选择k的方法,该算法使用最近邻技术来表征化合物在任意描述符空间中的空间位置。该算法为数据集生成一组曲线,然后对这些曲线进行分析以估计聚类的自然数量。然后,我们使用预测的k值以及相似的值进行k均值聚类,以检查是否获得了正确的聚类数。此外,我们将预测值与作为聚类质量度量的平均轮廓宽度所指示的数量进行了比较。我们在模拟数据以及两个化学数据集上测试了该算法。我们的结果表明,R - NN曲线算法能够确定聚类的自然数量,并且在确定最优聚类数方面与平均轮廓宽度总体上一致。