Chen Lexin, Roe Daniel R, Kochert Matthew, Simmerling Carlos, Miranda-Quintana Ramón Alain
Department of Chemistry, University of Florida, FL, USA.
Quantum Theory Project, University of Florida, FL, USA.
bioRxiv. 2024 Mar 8:2024.03.07.583975. doi: 10.1101/2024.03.07.583975.
One of the key challenges of -means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as -means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex datasets such as those obtained from molecular simulation, -means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of -means++ will lead to a lack of reproducibility. -means -Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient -ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping -means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse datasets and be used as a standalone tool or as part of our MDANCE clustering package.
K均值聚类的关键挑战之一是种子选择或初始质心估计,因为聚类结果在很大程度上取决于这一选择。诸如K均值++等方法通过使用经验概率分布估计质心来缓解这一限制。然而,对于从分子模拟中获得的高维和复杂数据集,K均值++无法以最优方式对数据进行划分。此外,所有类型的K均值++中的随机元素都会导致缺乏可重复性。提出了K均值-ARY自然初始化(NANI)作为一种替代方法,通过使用高效的ARY比较来识别数据中的高密度区域并选择一组不同的初始构象,从而应对这一挑战。由NANI生成的质心不仅代表数据且彼此不同,有助于K均值准确地对数据进行划分,而且具有确定性,在重复实验中提供一致的聚类数量。从肽和蛋白质折叠分子模拟来看,NANI能够创建紧凑且分离良好的聚类,并准确找到与文献一致的亚稳态。NANI可以对各种数据集进行聚类,既可以作为独立工具使用,也可以作为我们的MDANCE聚类包的一部分使用。