Department of Chemistry, Colorado State University, Fort Collins, Colorado 80523, United States.
Department of Chemistry, New York University, New York, New York 10003, United States.
J Chem Theory Comput. 2022 May 10;18(5):3218-3230. doi: 10.1021/acs.jctc.1c01290. Epub 2022 Apr 28.
Determining the optimal number and identity of structural clusters from an ensemble of molecular configurations continues to be a challenge. Recent structural clustering methods have focused on the use of internal coordinates due to the innate rotational and translational invariance of these features. The vast number of possible internal coordinates necessitates a feature space supervision step to make clustering tractable but yields a protocol that can be system type-specific. Particle positions offer an appealing alternative to internal coordinates but suffer from a lack of rotational and translational invariance, as well as a perceived insensitivity to regions of structural dissimilarity. Here, we present a method, denoted shape-GMM, that overcomes the shortcomings of particle positions using a weighted maximum likelihood alignment procedure. This alignment strategy is then built into an expectation maximization Gaussian mixture model (GMM) procedure to capture metastable states in the free-energy landscape. The resulting algorithm distinguishes between a variety of different structures, including those indistinguishable by root-mean-square displacement and pairwise distances, as demonstrated on several model systems. Shape-GMM results on an extensive simulation of the fast-folding HP35 Nle/Nle mutant protein support a four-state folding/unfolding mechanism, which is consistent with previous experimental results and provides kinetic details comparable to previous state-of-the art clustering approaches, as measured by the VAMP-2 score. Currently, training of shape-GMMs is recommended for systems (or subsystems) that can be represented by ≲200 particles and ≲100k configurations to estimate high-dimensional covariance matrices and balance computational expense. Once a shape-GMM is trained, it can be used to predict the cluster identities of millions of configurations.
从分子构象的集合中确定最佳的结构簇数量和身份仍然是一个挑战。最近的结构聚类方法侧重于使用内部坐标,因为这些特征具有固有的旋转和平移不变性。大量可能的内部坐标需要特征空间监督步骤来使聚类变得可行,但会产生一种特定于系统类型的协议。粒子位置提供了一种替代内部坐标的诱人选择,但由于缺乏旋转和平移不变性,以及对结构差异区域的感知不敏感性,因此受到限制。在这里,我们提出了一种方法,称为 shape-GMM,它使用加权最大似然对齐程序克服了粒子位置的缺点。然后,该对齐策略被构建到期望最大化高斯混合模型(GMM)程序中,以捕获自由能景观中的亚稳态。由此产生的算法可以区分多种不同的结构,包括那些通过均方根位移和成对距离无法区分的结构,这在几个模型系统上得到了证明。在对快速折叠 HP35 Nle/Nle 突变蛋白的广泛模拟中,shape-GMM 的结果支持了四态折叠/展开机制,这与先前的实验结果一致,并提供了与先前的最先进聚类方法相当的动力学细节,如 VAMP-2 评分所衡量的。目前,建议在可以用 ≲200 个粒子和 ≲100k 个构象表示的系统(或子系统)上训练 shape-GMM,以估计高维协方差矩阵并平衡计算费用。一旦训练了 shape-GMM,就可以用于预测数百万个构象的聚类身份。