Torrente Aurora, Kapushesky Misha, Brazma Alvis
EMBL Outstation-Hinxton, European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
Bioinformatics. 2005 Nov 1;21(21):3993-9. doi: 10.1093/bioinformatics/bti644. Epub 2005 Sep 1.
Clustering is one of the most widely used methods in unsupervised gene expression data analysis. The use of different clustering algorithms or different parameters often produces rather different results on the same data. Biological interpretation of multiple clustering results requires understanding how different clusters relate to each other. It is particularly non-trivial to compare the results of a hierarchical and a flat, e.g. k-means, clustering.
We present a new method for comparing and visualizing relationships between different clustering results, either flat versus flat, or flat versus hierarchical. When comparing a flat clustering to a hierarchical clustering, the algorithm cuts different branches in the hierarchical tree at different levels to optimize the correspondence between the clusters. The optimization function is based on graph layout aesthetics or on mutual information. The clusters are displayed using a bipartite graph where the edges are weighted proportionally to the number of common elements in the respective clusters and the weighted number of crossings is minimized. The performance of the algorithm is tested using simulated and real gene expression data. The algorithm is implemented in the online gene expression data analysis tool Expression Profiler.
聚类是无监督基因表达数据分析中使用最广泛的方法之一。使用不同的聚类算法或不同参数通常会对相同数据产生截然不同的结果。对多个聚类结果进行生物学解释需要理解不同聚类之间的关系。比较层次聚类和平坦聚类(例如k均值聚类)的结果尤其具有挑战性。
我们提出了一种用于比较和可视化不同聚类结果之间关系的新方法,这些结果可以是平坦聚类与平坦聚类之间,也可以是平坦聚类与层次聚类之间。在将平坦聚类与层次聚类进行比较时,该算法在层次树的不同级别切割不同分支,以优化聚类之间的对应关系。优化函数基于图形布局美学或互信息。聚类使用二分图显示,其中边的权重与相应聚类中共同元素的数量成比例,并且加权交叉数最小化。使用模拟和真实基因表达数据测试了该算法的性能。该算法在在线基因表达数据分析工具Expression Profiler中实现。