同时相干结构着色通过放大差异促进科学数据的可解释聚类。

Simultaneous coherent structure coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity.

机构信息

Department of Chemistry, Stanford University, Stanford, California, United States of America.

Department of Mechanical Engineering, Stanford University, Stanford, California, United States of America.

出版信息

PLoS One. 2019 Mar 13;14(3):e0212442. doi: 10.1371/journal.pone.0212442. eCollection 2019.

DOI:10.1371/journal.pone.0212442

PMID:30865644

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6415781/

Abstract

The clustering of data into physically meaningful subsets often requires assumptions regarding the number, size, or shape of the subgroups. Here, we present a new method, simultaneous coherent structure coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. sCSC performs a sequence of binary splittings on the dataset such that the most dissimilar data points are required to be in separate clusters. To achieve this, we obtain a set of orthogonal coordinates along which dissimilarity in the dataset is maximized from a generalized eigenvalue problem based on the pairwise dissimilarity between the data points to be clustered. This sequence of bifurcations produces a binary tree representation of the system, from which the number of clusters in the data and their interrelationships naturally emerge. To illustrate the effectiveness of the method in the absence of a priori assumptions, we apply it to three exemplary problems in fluid dynamics. Then, we illustrate its capacity for interpretability using a high-dimensional protein folding simulation dataset. While we restrict our examples to dynamical physical systems in this work, we anticipate straightforward translation to other fields where existing analysis tools require ad hoc assumptions on the data structure, lack the interpretability of the present method, or in which the underlying processes are less accessible, such as genomics and neuroscience.

摘要

将数据聚类为具有物理意义的子集通常需要假设子组的数量、大小或形状。在这里，我们提出了一种新方法，即同时一致结构着色（sCSC），它可以在没有关于数据底层结构的先验指导的情况下完成无监督聚类任务。sCSC 对数据集执行一系列二进制分割，使得最不相似的数据点必须位于不同的聚类中。为了实现这一点，我们从基于要聚类的数据点之间的成对不相似性的广义特征值问题中获得了一组正交坐标，沿这些坐标最大化数据集的不相似性。该分叉序列生成了系统的二叉树表示，从中可以自然地出现数据中的聚类数量及其相互关系。为了说明在没有先验假设的情况下该方法的有效性，我们将其应用于流体动力学中的三个示例问题。然后，我们使用高维蛋白质折叠模拟数据集来说明其可解释性的能力。虽然我们在这项工作中限制示例为动态物理系统，但我们预计它可以直接转换为其他领域，在这些领域中，现有的分析工具需要对数据结构进行特定假设，缺乏本方法的可解释性，或者在这些领域中，底层过程不太容易获得，例如基因组学和神经科学。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ca6/6415781/0d457bac6673/pone.0212442.g001.jpg

相似文献

Simultaneous coherent structure coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity.同时相干结构着色通过放大差异促进科学数据的可解释聚类。

PLoS One. 2019 Mar 13;14(3):e0212442. doi: 10.1371/journal.pone.0212442. eCollection 2019.

Nonlinear model reduction for dynamical systems using sparse sensor locations from learned libraries.利用从学习库中获取的稀疏传感器位置对动态系统进行非线性模型简化。

Phys Rev E Stat Nonlin Soft Matter Phys. 2015 Sep;92(3):033304. doi: 10.1103/PhysRevE.92.033304. Epub 2015 Sep 10.

Information-based clustering.基于信息的聚类

Proc Natl Acad Sci U S A. 2005 Dec 20;102(51):18297-302. doi: 10.1073/pnas.0507432102. Epub 2005 Dec 13.

Semisupervised Clustering by Iterative Partition and Regression with Neuroscience Applications.基于迭代划分与回归的半监督聚类及其神经科学应用

Comput Intell Neurosci. 2016;2016:4037380. doi: 10.1155/2016/4037380. Epub 2016 Apr 26.

Semi-supervised adaptive-height snipping of the hierarchical clustering tree.层次聚类树的半监督自适应高度剪枝

BMC Bioinformatics. 2015 Jan 16;16(1):15. doi: 10.1186/s12859-014-0448-1.

Clustering time series based on dependence structure.基于依赖结构的时间序列聚类。

PLoS One. 2018 Nov 12;13(11):e0206753. doi: 10.1371/journal.pone.0206753. eCollection 2018.

Detecting Clinically Meaningful Shape Clusters in Medical Image Data: Metrics Analysis for Hierarchical Clustering Applied to Healthy and Pathological Aortic Arches.在医学图像数据中检测具有临床意义的形状簇：应用于健康和病理性主动脉弓的层次聚类的指标分析

IEEE Trans Biomed Eng. 2017 Oct;64(10):2373-2383. doi: 10.1109/TBME.2017.2655364. Epub 2017 Feb 16.

A fuzzy relational clustering algorithm based on a dissimilarity measure extracted from data.一种基于从数据中提取的差异度量的模糊关系聚类算法。

IEEE Trans Syst Man Cybern B Cybern. 2004 Feb;34(1):775-82. doi: 10.1109/tsmcb.2003.817041.

Exploring high dimensional data with Butterfly: a novel classification algorithm based on discrete dynamical systems.基于离散动力系统的蝴蝶算法：一种高维数据分类新方法。

Bioinformatics. 2014 Mar 1;30(5):712-8. doi: 10.1093/bioinformatics/btt602. Epub 2013 Oct 21.

A redundancy-based measure of dissimilarity among probability distributions for hierarchical clustering criteria.一种基于冗余的概率分布间差异度量，用于层次聚类准则。

IEEE Trans Pattern Anal Mach Intell. 2008 Jan;30(1):76-88. doi: 10.1109/TPAMI.2007.1160.

引用本文的文献

Lagrangian gradient regression for the detection of coherent structures from sparse trajectory data.用于从稀疏轨迹数据中检测相干结构的拉格朗日梯度回归

R Soc Open Sci. 2024 Oct 30;11(10):240586. doi: 10.1098/rsos.240586. eCollection 2024 Oct.

本文引用的文献

Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance.机器学习和结核分枝杆菌泛基因组结构分析鉴定抗生素耐药的遗传特征。

Nat Commun. 2018 Oct 17;9(1):4306. doi: 10.1038/s41467-018-06634-y.

MoleculeNet: a benchmark for molecular machine learning.分子网络：分子机器学习的一个基准

Chem Sci. 2017 Oct 31;9(2):513-530. doi: 10.1039/c7sc02664a. eCollection 2018 Jan 14.

Markov State Models: From an Art to a Science.马尔可夫状态模型：从一门艺术到一门科学。

J Am Chem Soc. 2018 Feb 21;140(7):2386-2396. doi: 10.1021/jacs.7b12191. Epub 2018 Feb 2.

A Minimum Variance Clustering Approach Produces Robust and Interpretable Coarse-Grained Models.一种最小方差聚类方法可生成稳健且可解释的粗粒度模型。

J Chem Theory Comput. 2018 Feb 13;14(2):1071-1082. doi: 10.1021/acs.jctc.7b01004. Epub 2018 Jan 24.

Identification of individual coherent sets associated with flow trajectories using coherent structure coloring.使用相干结构着色法识别与流动轨迹相关的单个相干集。

Chaos. 2017 Sep;27(9):091101. doi: 10.1063/1.4993862.

A critical comparison of Lagrangian methods for coherent structure detection.用于相干结构检测的拉格朗日方法的关键比较。

Chaos. 2017 May;27(5):053104. doi: 10.1063/1.4982720.

Variational Koopman models: Slow collective variables and molecular kinetics from short off-equilibrium simulations.变分 Koopman 模型：从短非平衡模拟中提取慢集体变量和分子动力学。

J Chem Phys. 2017 Apr 21;146(15):154104. doi: 10.1063/1.4979344.

Building a More Predictive Protein Force Field: A Systematic and Reproducible Route to AMBER-FB15.构建更具预测性的蛋白质力场：通往AMBER-FB15的系统且可重复的途径。

J Phys Chem B. 2017 Apr 27;121(16):4023-4039. doi: 10.1021/acs.jpcb.7b02320. Epub 2017 Apr 6.

Phytoplankton can actively diversify their migration strategy in response to turbulent cues.浮游植物可以根据动荡的线索积极地改变它们的迁移策略。

Nature. 2017 Mar 23;543(7646):555-558. doi: 10.1038/nature21415. Epub 2017 Mar 15.

Ward Clustering Improves Cross-Validated Markov State Models of Protein Folding.分域聚类提高蛋白质折叠的交叉验证马氏态模型。

J Chem Theory Comput. 2017 Mar 14;13(3):963-967. doi: 10.1021/acs.jctc.6b01238. Epub 2017 Feb 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

同时相干结构着色通过放大差异促进科学数据的可解释聚类。

Simultaneous coherent structure coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献