k均值NANI：一种用于分子动力学模拟的改进聚类算法。

k-Means NANI: an improved clustering algorithm for Molecular Dynamics simulations.

作者信息

Chen Lexin, Roe Daniel R, Kochert Matthew, Simmerling Carlos, Miranda-Quintana Ramón Alain

机构信息

Department of Chemistry, University of Florida, FL, USA.

Quantum Theory Project, University of Florida, FL, USA.

出版信息

bioRxiv. 2024 Mar 8:2024.03.07.583975. doi: 10.1101/2024.03.07.583975.

DOI:10.1101/2024.03.07.583975

PMID:38496504

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10942464/

Abstract

One of the key challenges of -means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as -means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex datasets such as those obtained from molecular simulation, -means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of -means++ will lead to a lack of reproducibility. -means -Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient -ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping -means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse datasets and be used as a standalone tool or as part of our MDANCE clustering package.

摘要

K均值聚类的关键挑战之一是种子选择或初始质心估计，因为聚类结果在很大程度上取决于这一选择。诸如K均值++等方法通过使用经验概率分布估计质心来缓解这一限制。然而，对于从分子模拟中获得的高维和复杂数据集，K均值++无法以最优方式对数据进行划分。此外，所有类型的K均值++中的随机元素都会导致缺乏可重复性。提出了K均值-ARY自然初始化（NANI）作为一种替代方法，通过使用高效的ARY比较来识别数据中的高密度区域并选择一组不同的初始构象，从而应对这一挑战。由NANI生成的质心不仅代表数据且彼此不同，有助于K均值准确地对数据进行划分，而且具有确定性，在重复实验中提供一致的聚类数量。从肽和蛋白质折叠分子模拟来看，NANI能够创建紧凑且分离良好的聚类，并准确找到与文献一致的亚稳态。NANI可以对各种数据集进行聚类，既可以作为独立工具使用，也可以作为我们的MDANCE聚类包的一部分使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1707/10942464/51eab00686c5/nihpp-2024.03.07.583975v1-f0001.jpg

相似文献

k-Means NANI: an improved clustering algorithm for Molecular Dynamics simulations.k均值NANI：一种用于分子动力学模拟的改进聚类算法。

bioRxiv. 2024 Mar 8:2024.03.07.583975. doi: 10.1101/2024.03.07.583975.

k-Means NANI: An Improved Clustering Algorithm for Molecular Dynamics Simulations.k均值NANI：一种用于分子动力学模拟的改进聚类算法。

J Chem Theory Comput. 2024 Jul 9;20(13):5583-5597. doi: 10.1021/acs.jctc.4c00308. Epub 2024 Jun 21.

An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data.一种增强型确定性 K-Means 聚类算法，用于从基因表达数据中预测癌症亚型。

Comput Biol Med. 2017 Dec 1;91:213-221. doi: 10.1016/j.compbiomed.2017.10.014. Epub 2017 Oct 23.

Does Determination of Initial Cluster Centroids Improve the Performance of -Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm, Minimum Spanning Tree, and Hierarchical Clustering in an Applied Study.初始聚类质心的确定是否能提高 -Means 聚类算法的性能？在应用研究中，通过遗传算法、最小生成树和层次聚类三种混合方法的比较。

Comput Math Methods Med. 2020 Aug 1;2020:7636857. doi: 10.1155/2020/7636857. eCollection 2020.

Boosting k-means clustering with symbiotic organisms search for automatic clustering problems.利用共生生物搜索算法增强 k-均值聚类算法以解决自动聚类问题。

PLoS One. 2022 Aug 11;17(8):e0272861. doi: 10.1371/journal.pone.0272861. eCollection 2022.

A Novel Model on Reinforce K-Means Using Location Division Model and Outlier of Initial Value for Lowering Data Cost.一种基于位置划分模型和初始值离群点强化K均值的新型模型，用于降低数据成本。

Entropy (Basel). 2020 Aug 17;22(8):902. doi: 10.3390/e22080902.

Protein Retrieval via Integrative Molecular Ensembles (PRIME) through Extended Similarity Indices.通过扩展相似性指数的综合分子组合（PRIME）进行蛋白质提取。

J Chem Theory Comput. 2024 Jul 23;20(14):6303-6315. doi: 10.1021/acs.jctc.4c00362. Epub 2024 Jul 8.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Subspace K-means clustering.子空间 K-均值聚类。

Behav Res Methods. 2013 Dec;45(4):1011-23. doi: 10.3758/s13428-013-0329-y.

Research and Application of Clustering Algorithm for Text Big Data.文本大数据聚类算法的研究与应用

Comput Intell Neurosci. 2022 Jun 8;2022:7042778. doi: 10.1155/2022/7042778. eCollection 2022.

本文引用的文献

iSIM: instant similarity.iSIM：即时相似度。

Digit Discov. 2024 May 7;3(6):1160-1171. doi: 10.1039/d4dd00041b. eCollection 2024 Jun 12.

Quantifying Unbiased Conformational Ensembles from Biased Simulations Using ShapeGMM.使用 ShapeGMM 从有偏模拟中定量无偏构象集合。

J Chem Theory Comput. 2024 May 14;20(9):3492-3502. doi: 10.1021/acs.jctc.4c00223. Epub 2024 Apr 25.

On Quality Thresholds for the Clustering of Molecular Structures.关于分子结构聚类的质量阈值。

J Chem Inf Model. 2022 Nov 28;62(22):5738-5745. doi: 10.1021/acs.jcim.2c01079. Epub 2022 Oct 20.

Molecular Dynamics Simulations and Diversity Selection by Extended Continuous Similarity Indices.分子动力学模拟与通过扩展连续相似性指数进行的多样性选择。

J Chem Inf Model. 2022 Jul 25;62(14):3415-3425. doi: 10.1021/acs.jcim.2c00433. Epub 2022 Jul 14.

Size-and-Shape Space Gaussian Mixture Models for Structural Clustering of Molecular Dynamics Trajectories.用于分子动力学轨迹结构聚类的大小和形状空间高斯混合模型。

J Chem Theory Comput. 2022 May 10;18(5):3218-3230. doi: 10.1021/acs.jctc.1c01290. Epub 2022 Apr 28.

Extended continuous similarity indices: theory and application for QSAR descriptor selection.扩展连续相似性指数：QSAR 描述符选择的理论与应用。

J Comput Aided Mol Des. 2022 Mar;36(3):157-173. doi: 10.1007/s10822-022-00444-7. Epub 2022 Mar 15.

Improving the analysis of biological ensembles through extended similarity measures.通过扩展相似性度量来改进生物集合体的分析。

Phys Chem Chem Phys. 2021 Dec 22;24(1):444-451. doi: 10.1039/d1cp04019g.

BitQT: a graph-based approach to the quality threshold clustering of molecular dynamics.BitQT：一种基于图形的分子动力学质量阈值聚类方法。

Bioinformatics. 2021 Dec 22;38(1):73-79. doi: 10.1093/bioinformatics/btab595.

Unsupervised Learning Methods for Molecular Simulation Data.无监督学习方法在分子模拟数据中的应用。

Chem Rev. 2021 Aug 25;121(16):9722-9758. doi: 10.1021/acs.chemrev.0c01195. Epub 2021 May 4.

J Cheminform. 2021 Apr 23;13(1):32. doi: 10.1186/s13321-021-00505-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

k均值NANI：一种用于分子动力学模拟的改进聚类算法。

k-Means NANI: an improved clustering algorithm for Molecular Dynamics simulations.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献