Bahcesehir University, Istanbul.
IEEE/ACM Trans Comput Biol Bioinform. 2012;9(2):408-20. doi: 10.1109/TCBB.2011.129. Epub 2011 Sep 27.
Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS, which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final clustering automatically and does not take any input parameters, a feature missing in many existing algorithms. Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers by being scalable, and by consuming very little memory and CPU.
聚类在各种科学领域有着悠久而丰富的历史。正如文献中数百种聚类算法所证明的那样,要找到数据集的自然分组是一项艰巨的任务。每种聚类技术都对基础数据集做出了一些假设。如果假设成立,则可以预期得到良好的聚类。在某些情况下,满足所有假设是困难的,甚至是不可能的。因此,在同一数据集上应用不同的聚类方法,或者对同一方法使用不同的输入参数或两者兼而有之,是有益的。我们提出了一种新的方法 DICLENS,它将一组聚类组合成具有更好整体质量的最终聚类。我们的方法自动生成最终聚类,不需要任何输入参数,这是许多现有算法所缺少的功能。对真实、人工和基因表达数据集的广泛实验研究表明,DICLENS 可以在短时间内生成非常高质量的聚类。DICLENS 的实现通过可扩展性在标准个人计算机上运行,并且消耗的内存和 CPU 非常少。