Ferrarotti Marco Jacopo, Rocchia Walter, Decherchi Sergio
IEEE Trans Neural Netw Learn Syst. 2019 Aug;30(8):2449-2462. doi: 10.1109/TNNLS.2018.2884792. Epub 2018 Dec 25.
In this paper, we introduce the concept of principal paths in data space; we show that this is a well-characterized problem from the point of view of cognition, and that it can lead to salient insights in the analyzed data enabling topological/holistic descriptions. These paths, interestingly, can be interpreted as local principal curves, and in this paper, we suggest that they are analogous to what, in the statistical mechanics realm, are called minimum free-energy paths. Here, we move that concept from physics to data space and compute them in both the original and the kernel space. The algorithm is a regularized version of the well-known k -means clustering algorithm. The regularization parameter is derived via an in-sample model selection process based on the Bayesian evidence maximization. Interestingly, we show that this choice for the regularization parameter consistently leads to the same manifold even when changing the number of clusters. We apply the method to common data sets, dynamical systems, and, in particular, to molecular dynamics trajectories showing the generality, the usefulness of the approach and its superiority with respect to other related approaches.
在本文中,我们引入了数据空间中主路径的概念;我们表明,从认知角度来看,这是一个特征明确的问题,并且它能够在分析的数据中带来显著的见解,从而实现拓扑/整体描述。有趣的是,这些路径可以被解释为局部主曲线,在本文中,我们认为它们类似于统计力学领域中所谓的最小自由能路径。在此,我们将该概念从物理领域迁移到数据空间,并在原始空间和核空间中进行计算。该算法是著名的k均值聚类算法的正则化版本。正则化参数是通过基于贝叶斯证据最大化的样本内模型选择过程推导得出的。有趣的是,我们表明,即使改变聚类数量,这种正则化参数的选择也始终会导致相同的流形。我们将该方法应用于常见数据集、动态系统,特别是分子动力学轨迹,展示了该方法的通用性、实用性及其相对于其他相关方法的优越性。