Novoselova Natalia, Wang Junxi, Klawonn Frank
Department of Bioinformatics, United Institute of Informatics Problems, Surganova Str. 6, Minsk 220012, Belarus.
Biostatistics, Helmholtz Centre for Infection Research, Inhoffenstraße 7, 38124, Braunschweig, Germany.
J Bioinform Comput Biol. 2015 Aug;13(4):1550012. doi: 10.1142/S0219720015500122. Epub 2015 Mar 2.
Hierarchical clustering is extensively used in the bioinformatics community to analyze biomedical data. These data are often tagged with class labels, as e.g. disease subtypes or gene ontology (GO) terms. Heatmaps in connection with dendrograms are the common standard to visualize results of hierarchical clustering. The heatmap can be enriched by an additional color bar at the side, indicating for each instance in the data set to which class it belongs. In the ideal case, when the clustering matches perfectly with the classes, one would expect that instances from the same class cluster together and the color bar consists of well-separated color blocks without frequent alteration of colors (classes). But even in the case when instances from the same class cluster perfectly together, the dendrogram might not reflect this important aspect due to the fact that its representation is not unique. In this paper, we propose a leaf ordering algorithm for the dendrogram that preserving the hierarchical clustering result tries to group instances from the same class together. It is based on the concept of dynamic programming which can efficiently compute the optimal or nearly optimal order, consistent with the structure of the tree.
层次聚类在生物信息学领域被广泛用于分析生物医学数据。这些数据通常带有类别标签,例如疾病亚型或基因本体(GO)术语。结合树形图的热图是可视化层次聚类结果的通用标准。热图可以通过在其一侧添加一个额外的颜色条来丰富,该颜色条指示数据集中的每个实例所属的类别。在理想情况下,当聚类与类别完美匹配时,人们会期望来自同一类别的实例聚集在一起,并且颜色条由分隔良好的颜色块组成,颜色(类别)不会频繁变化。但即使在来自同一类别的实例完美聚集在一起的情况下,树形图可能也无法反映这一重要方面,因为其表示方式并非唯一。在本文中,我们提出了一种针对树形图的叶排序算法,该算法在保留层次聚类结果的同时,试图将来自同一类别的实例聚集在一起。它基于动态规划的概念,能够根据树的结构高效地计算出最优或接近最优的顺序。