Chen Lexin, Brylle Woody Santos Jherome, Gaza Jokent, Perez Alberto, Miranda-Quintana Ramón Alain
Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville 32611, Florida, United States.
J Chem Inf Model. 2025 Jun 23;65(12):6209-6220. doi: 10.1021/acs.jcim.5c00539. Epub 2025 Jun 2.
Clustering remains a key tool in the analysis of molecular dynamics (MD) simulations, from the preparation of kinetic models to the study of mechanistic pathways and structural determination. It is no surprise then that multiple algorithms are currently used in the MD community, with -means and hierarchical approaches being arguably the two most popular approaches. The former is very attractive from a purely computational point of view, demanding minimal memory and time resources, but at the price of being able to partition the data in very restrictive ways. Hierarchical strategies, on the other hand, can generate arbitrary partitions, but with steep memory and time requirements due to their need to build a pairwise distance matrix for all the considered conformations/frames. Here we propose a new hybrid paradigm, the hierarchical extended linkage method (HELM), that retains the efficiency of -means while incorporating the flexibility of hierarchical methods. The key ingredient is the use of -ary difference functions as a way to stabilize the -means results and efficiently build the hierarchy of subsets. We showcase the applicability of this strategy over protein-DNA and protein folding studies, including the complete analysis of simulations with over 1.5 million frames. HELM is freely available in our MDANCE clustering package.
聚类仍然是分子动力学(MD)模拟分析中的关键工具,从动力学模型的构建到机理途径的研究以及结构确定。因此,毫不奇怪MD领域目前使用了多种算法,其中k均值和层次聚类方法可以说是最受欢迎的两种方法。从纯粹的计算角度来看,前者非常有吸引力,只需要极少的内存和时间资源,但代价是只能以非常受限的方式对数据进行划分。另一方面,层次聚类策略可以生成任意划分,但由于需要为所有考虑的构象/帧构建成对距离矩阵,因此对内存和时间的要求很高。在此,我们提出了一种新的混合范式,即层次扩展链接方法(HELM),它保留了k均值的效率,同时融入了层次聚类方法的灵活性。关键要素是使用q元差分函数来稳定k均值的结果并有效地构建子集层次结构。我们展示了该策略在蛋白质-DNA和蛋白质折叠研究中的适用性,包括对超过150万个帧的模拟进行完整分析。HELM可在我们的MDANCE聚类软件包中免费获取。