Department of Chemistry and Applied Biosciences, ETH Zürich, Vladimir-Prelog-Weg 2, 8093 Zürich, Switzerland.
J Chem Phys. 2023 Jul 14;159(2). doi: 10.1063/5.0148735.
Clustering has become an indispensable tool in the presence of increasingly large and complex datasets. Most clustering algorithms depend, either explicitly or implicitly, on the sampled density. However, estimated densities are fragile due to the curse of dimensionality and finite sampling effects, for instance, in molecular dynamics simulations. To avoid the dependence on estimated densities, an energy-based clustering (EBC) algorithm based on the Metropolis acceptance criterion is developed in this work. In the proposed formulation, EBC can be considered a generalization of spectral clustering in the limit of large temperatures. Taking the potential energy of a sample explicitly into account alleviates requirements regarding the distribution of the data. In addition, it permits the subsampling of densely sampled regions, which can result in significant speed-ups and sublinear scaling. The algorithm is validated on a range of test systems including molecular dynamics trajectories of alanine dipeptide and the Trp-cage miniprotein. Our results show that including information about the potential-energy surface can largely decouple clustering from the sampled density.
聚类已经成为处理日益庞大和复杂数据集的不可或缺的工具。大多数聚类算法要么显式地,要么隐式地依赖于采样密度。然而,由于维度诅咒和有限的采样效应,例如在分子动力学模拟中,估计的密度是脆弱的。为了避免对估计密度的依赖,本文开发了一种基于 Metropolis 接受准则的基于能量的聚类(EBC)算法。在提出的公式中,EBC 可以被认为是在大温度极限下谱聚类的推广。明确考虑样本的势能可以减轻对数据分布的要求。此外,它允许对密集采样区域进行子采样,从而可以实现显著的加速和次线性缩放。该算法在一系列测试系统上进行了验证,包括丙氨酸二肽和 Trp-cage 小蛋白的分子动力学轨迹。我们的结果表明,包含关于势能面的信息可以将聚类与采样密度很大程度上解耦。