Xu Shutan, Zou Shuxue, Wang Lincong
College of Computer Science and Technology, Jilin University , Changchun, P.R. China .
J Comput Biol. 2015 May;22(5):436-50. doi: 10.1089/cmb.2014.0162. Epub 2014 Dec 17.
An important feature of structural data, especially those from structural determination and protein-ligand docking programs, is that their distribution could be mostly uniform. Traditional clustering algorithms developed specifically for nonuniformly distributed data may not be adequate for their classification. Here we present a geometric partitional algorithm that could be applied to both uniformly and nonuniformly distributed data. The algorithm is a top-down approach that recursively selects the outliers as the seeds to form new clusters until all the structures within a cluster satisfy a classification criterion. The algorithm has been evaluated on a diverse set of real structural data and six sets of test data. The results show that it is superior to the previous algorithms for the clustering of structural data and is similar to or better than them for the classification of the test data. The algorithm should be especially useful for the identification of the best but minor clusters and for speeding up an iterative process widely used in NMR structure determination.
结构数据的一个重要特征,尤其是来自结构测定和蛋白质-配体对接程序的数据,是它们的分布大多可能是均匀的。专门为非均匀分布数据开发的传统聚类算法可能不足以对其进行分类。在此,我们提出一种几何划分算法,该算法可应用于均匀分布和非均匀分布的数据。该算法是一种自上而下的方法,它递归地选择离群值作为种子来形成新的聚类,直到一个聚类中的所有结构都满足分类标准。该算法已在一组多样的真实结构数据和六组测试数据上进行了评估。结果表明,对于结构数据的聚类,它优于先前的算法,对于测试数据的分类,它与先前算法相似或更好。该算法对于识别最佳但较小的聚类以及加速核磁共振结构测定中广泛使用的迭代过程应该特别有用。