Institute of Clinical Pharmacology, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany.
Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Project Group Translational Medicine and Pharmacology TMP, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany.
Int J Mol Sci. 2019 Dec 20;21(1):79. doi: 10.3390/ijms21010079.
Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroup structure. In other data sets, t-SNE occasionally suggested the wrong number of subgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.
流式细胞术的进步使得每个患者都能够获取大型的高维数据集。新颖的计算技术允许对这些数据中的结构进行可视化,最终确定相关的亚组。要将高维空间中的数据正确地可视化和投影到可视化平面上,需要正确表示数据中的结构。本研究表明,目前常用的技术在这方面不可靠。该领域中用于数据投影的最重要方法之一是 t 分布随机邻域嵌入(t-SNE)。我们分析了其在人工和真实生物医学数据集上的性能。t-SNE 为同质分布数据引入了聚类结构,但这些数据中不包含任何亚组结构。在其他数据集上,t-SNE 偶尔会错误地提示亚组的数量,或者将属于不同亚组的数据点投影到同一亚组中。作为替代方法,我们使用了新兴的自组织映射(ESOM)与 U 矩阵方法相结合。这种方法允许在同质数据集中正确识别,而在包含基于距离或密度的亚组结构的集中,可以正确显示亚组的数量和数据点分配。结果突出了在使用当前广泛应用的算法技术检测高维流式细胞术数据中的亚组时可能存在的陷阱,并提出了一种稳健的替代方法。