Department of Chemistry, Stanford University, Stanford, California, United States of America.
Department of Mechanical Engineering, Stanford University, Stanford, California, United States of America.
PLoS One. 2019 Mar 13;14(3):e0212442. doi: 10.1371/journal.pone.0212442. eCollection 2019.
The clustering of data into physically meaningful subsets often requires assumptions regarding the number, size, or shape of the subgroups. Here, we present a new method, simultaneous coherent structure coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. sCSC performs a sequence of binary splittings on the dataset such that the most dissimilar data points are required to be in separate clusters. To achieve this, we obtain a set of orthogonal coordinates along which dissimilarity in the dataset is maximized from a generalized eigenvalue problem based on the pairwise dissimilarity between the data points to be clustered. This sequence of bifurcations produces a binary tree representation of the system, from which the number of clusters in the data and their interrelationships naturally emerge. To illustrate the effectiveness of the method in the absence of a priori assumptions, we apply it to three exemplary problems in fluid dynamics. Then, we illustrate its capacity for interpretability using a high-dimensional protein folding simulation dataset. While we restrict our examples to dynamical physical systems in this work, we anticipate straightforward translation to other fields where existing analysis tools require ad hoc assumptions on the data structure, lack the interpretability of the present method, or in which the underlying processes are less accessible, such as genomics and neuroscience.
将数据聚类为具有物理意义的子集通常需要假设子组的数量、大小或形状。在这里,我们提出了一种新方法,即同时一致结构着色(sCSC),它可以在没有关于数据底层结构的先验指导的情况下完成无监督聚类任务。sCSC 对数据集执行一系列二进制分割,使得最不相似的数据点必须位于不同的聚类中。为了实现这一点,我们从基于要聚类的数据点之间的成对不相似性的广义特征值问题中获得了一组正交坐标,沿这些坐标最大化数据集的不相似性。该分叉序列生成了系统的二叉树表示,从中可以自然地出现数据中的聚类数量及其相互关系。为了说明在没有先验假设的情况下该方法的有效性,我们将其应用于流体动力学中的三个示例问题。然后,我们使用高维蛋白质折叠模拟数据集来说明其可解释性的能力。虽然我们在这项工作中限制示例为动态物理系统,但我们预计它可以直接转换为其他领域,在这些领域中,现有的分析工具需要对数据结构进行特定假设,缺乏本方法的可解释性,或者在这些领域中,底层过程不太容易获得,例如基因组学和神经科学。