Ultsch Alfred, Lötsch Jörn
DataBionics Research Group, University of Marburg, Hans-Meerwein-Straβe, 35032 Marburg, Germany.
Institute of Clinical Pharmacology, Goethe-University, Theodor Stern Kai 7, 60590 Frankfurt am Main, Germany; Fraunhofer Institute of Molecular Biology and Applied Ecology-Project Group Translational Medicine and Pharmacology (IME-TMP), Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany.
J Biomed Inform. 2017 Feb;66:95-104. doi: 10.1016/j.jbi.2016.12.011. Epub 2016 Dec 28.
High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM).
Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means.
Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data.
The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.
高维生物医学数据经常被聚类,以识别指向不同疾病亚型的亚组结构。所使用的聚类算法正确运行至关重要。然而,通过对聚类施加预定义的形状,经典算法偶尔会在均匀分布的数据中暗示聚类结构,或者将数据点分配到错误的聚类中。我们分析了使用涌现自组织特征映射(ESOM)是否可以避免这种情况。
使用基于R的交互式生物信息学工具,将具有不同复杂程度的数据集提交给具有大量神经元的ESOM分析。在训练好的ESOM之上,高维特征空间中的距离结构以所谓的U矩阵形式可视化。将聚类结果与包括单链、Ward和k均值在内的经典常用聚类算法提供的结果进行比较。
Ward聚类在完全没有结构的“高尔夫球”“长方体”和“S形”无聚类数据集(随机数据)上强加了聚类结构。Ward聚类还在置换后的真实世界数据集上强加了结构。相比之下,ESOM/U矩阵方法正确地发现这些数据不包含聚类结构。然而,ESOM/U矩阵在识别真正包含亚组的生物医学数据中的聚类时是正确的。在进一步的典型人工数据的聚类结构识别中它总是正确的。使用故意简单的数据集表明,通常用于生物医学数据集的流行聚类算法可能无法正确地对数据进行聚类,这表明它们在高维生物医学数据上也可能会错误地执行。
目前的分析强调,普遍使用的经典层次聚类算法有产生错误结果的显著倾向。相比之下,使用ESOM/U矩阵方法进行的无监督机器学习聚类结构分析是一种可行的、无偏的方法,用于在复杂数据的高维空间中识别真正的聚类。