分类数据的高效分层基于密度的聚类

Efficient layered density-based clustering of categorical data.

作者信息

Andreopoulos Bill, An Aijun, Wang Xiaogang, Labudde Dirk

机构信息

Biotechnological Centre, Technische Universität Dresden, 47-51 Tatzberg, 01307 Dresden Sachsen, Germany.

出版信息

J Biomed Inform. 2009 Apr;42(2):365-76. doi: 10.1016/j.jbi.2008.11.004. Epub 2008 Dec 10.

DOI:10.1016/j.jbi.2008.11.004

PMID:19111944

Abstract

A challenge involved in applying density-based clustering to categorical biomedical data is that the "cube" of attribute values has no ordering defined, making the search for dense subspaces slow. We propose the HIERDENC algorithm for hierarchical density-based clustering of categorical data, and a complementary index for searching for dense subspaces efficiently. The HIERDENC index is updated when new objects are introduced, such that clustering does not need to be repeated on all objects. The updating and cluster retrieval are efficient. Comparisons with several other clustering algorithms showed that on large datasets HIERDENC achieved better runtime scalability on the number of objects, as well as cluster quality. By fast collapsing the bicliques in large networks we achieved an edge reduction of as much as 86.5%. HIERDENC is suitable for large and quickly growing datasets, since it is independent of object ordering, does not require re-clustering when new data emerges, and requires no user-specified input parameters.

摘要

将基于密度的聚类应用于分类生物医学数据时面临的一个挑战是，属性值的“立方体”没有定义顺序，这使得寻找密集子空间的速度很慢。我们提出了用于分类数据基于层次密度聚类的HIERDENC算法，以及一种用于高效搜索密集子空间的互补索引。当引入新对象时，HIERDENC索引会更新，这样就无需对所有对象重复进行聚类。更新和聚类检索效率很高。与其他几种聚类算法的比较表明，在大型数据集上，HIERDENC在对象数量方面实现了更好的运行时可扩展性，以及聚类质量。通过快速折叠大型网络中的双簇图我们实现了高达86.5%的边减少。HIERDENC适用于大型且快速增长的数据集，因为它与对象顺序无关，新数据出现时不需要重新聚类，并且不需要用户指定输入参数。