Department of Chemistry, University of Konstanz, Konstanz, Germany.
Theory Department, Max Planck Institute for Polymer Research, Mainz, Germany.
J Chem Phys. 2023 Apr 14;158(14):144109. doi: 10.1063/5.0142797.
We present an unsupervised data processing workflow that is specifically designed to obtain a fast conformational clustering of long molecular dynamics simulation trajectories. In this approach, we combine two dimensionality reduction algorithms (cc_analysis and encodermap) with a density-based spatial clustering algorithm (hierarchical density-based spatial clustering of applications with noise). The proposed scheme benefits from the strengths of the three algorithms while avoiding most of the drawbacks of the individual methods. Here, the cc_analysis algorithm is applied for the first time to molecular simulation data. The encodermap algorithm complements cc_analysis by providing an efficient way to process and assign large amounts of data to clusters. The main goal of the procedure is to maximize the number of assigned frames of a given trajectory while keeping a clear conformational identity of the clusters that are found. In practice, we achieve this by using an iterative clustering approach and a tunable root-mean-square-deviation-based criterion in the final cluster assignment. This allows us to find clusters of different densities and different degrees of structural identity. With the help of four protein systems, we illustrate the capability and performance of this clustering workflow: wild-type and thermostable mutant of the Trp-cage protein (TC5b and TC10b), NTL9, and Protein B. Each of these test systems poses their individual challenges to the scheme, which, in total, give a nice overview of the advantages and potential difficulties that can arise when using the proposed method.
我们提出了一种无监督的数据处理工作流程,专门用于获得长分子动力学模拟轨迹的快速构象聚类。在这种方法中,我们将两种降维算法(cc_analysis 和 encodermap)与基于密度的空间聚类算法(基于噪声的应用分层密度聚类)相结合。该方案结合了三种算法的优势,同时避免了单个方法的大部分缺点。这里,cc_analysis 算法首次应用于分子模拟数据。encodermap 算法通过提供一种高效的方法来处理和将大量数据分配给聚类,补充了 cc_analysis。该过程的主要目标是在保持找到的聚类明确构象身份的同时,最大化给定轨迹的分配帧数。在实践中,我们通过使用迭代聚类方法和最终聚类分配中的基于均方根偏差的可调标准来实现这一点。这允许我们找到具有不同密度和不同结构同一性程度的聚类。通过四个蛋白质系统,我们说明了这种聚类工作流程的能力和性能:色氨酸笼蛋白(TC5b 和 TC10b)、NTL9 和蛋白 B 的野生型和热稳定突变体。这些测试系统中的每一个都对该方案提出了各自的挑战,总的来说,该方案很好地概述了使用所提出的方法可能出现的优点和潜在困难。