Departament d'Enginyeria Informática i Matemátiques, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007, Tarragona, Spain.
Departament d'Enginyeria Quámica, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007, Tarragona, Spain.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac312.
Agglomerative hierarchical clustering has become a common tool for the analysis and visualization of data, thus being present in a large amount of scientific research and predating all areas of bioinformatics and computational biology. In this work, we focus on a critical problem, the nonuniqueness of the clustering when there are tied distances, for which several solutions exist but are not implemented in most hierarchical clustering packages. We analyze the magnitude of this problem in one particular setting: the clustering of microsatellite markers using the Unweighted Pair-Group Method with Arithmetic Mean. To do so, we have calculated the fraction of publications at the Scopus database in which more than one hierarchical clustering is possible, showing that about 46% of the articles are affected. Additionally, to show the problem from a practical point of view, we selected two opposite examples of articles that have multiple solutions: one with two possible dendrograms, and the other with more than 2.5 million different possible hierarchical clusterings.
凝聚层次聚类已成为分析和可视化数据的常用工具,因此存在于大量的科学研究中,并先于生物信息学和计算生物学的所有领域。在这项工作中,我们专注于一个关键问题,即当存在关联距离时聚类的非唯一性,对此存在几种解决方案,但在大多数层次聚类软件包中并未实现。我们在一个特定的环境中分析了这个问题的严重程度:使用算术平均值的未加权对组方法对微卫星标记进行聚类。为此,我们计算了 Scopus 数据库中存在多个层次聚类的出版物的分数,表明大约 46%的文章受到影响。此外,为了从实际角度展示这个问题,我们选择了两篇具有多个解决方案的文章作为对比案例:一篇有两个可能的树状图,另一篇则有超过 250 万种不同的可能层次聚类。