Ribeiro Rafael Bicudo, Cezar Henrique Musseli
Institute of Physics, University of São Paulo, Rua do Matão 1731, 05508-090 São Paulo, São Paulo, Brazil.
Hylleraas Centre for Quantum Molecular Sciences and Department of Chemistry, University of Oslo, PO Box 1033 Blindern, 0315 Oslo, Norway.
J Chem Theory Comput. 2025 Jul 22;21(14):6759-6768. doi: 10.1021/acs.jctc.5c00634. Epub 2025 Jul 3.
Clustering techniques are consolidated as a powerful strategy for analyzing the extensive data generated from molecular modeling. In particular, some tools have been developed to cluster configurations from classical simulations with a standard focus on individual units, ranging from small molecules to complex proteins. Since the standard approach includes computing the root mean square deviation (RMSD) of atomic positions, accounting for the permutation between atoms is crucial for optimizing the clustering procedure in the presence of identical molecules. To address this issue, we present the clusttraj program, a solvent-informed clustering package that fixes inflated RMSD values by finding the optimal pairing between configurations. The program combines reordering schemes with the Kabsch algorithm to minimize the RMSD of molecular configurations before running a hierarchical clustering protocol. By considering evaluation metrics, one can determine the ideal threshold in an automated fashion and compare the different linkage schemes available. The program capabilities are exemplified by considering solute-solvent systems ranging from pure water clusters to a solvated protein or a small solute in different solvents. As a result, we investigate the dependence on different parameters, such as the system size and reordering method, and also the representativeness of the cluster medoids for the characterization of optical properties. clusttraj is implemented as a Python library and can be employed to cluster generic ensembles of molecular configurations that go beyond solute-solvent systems.
聚类技术已成为分析分子建模产生的大量数据的强大策略。特别是,已经开发了一些工具来对经典模拟中的构型进行聚类,标准重点是从小分子到复杂蛋白质的单个单元。由于标准方法包括计算原子位置的均方根偏差(RMSD),在存在相同分子的情况下考虑原子间的排列对于优化聚类过程至关重要。为了解决这个问题,我们提出了clusttraj程序,这是一个溶剂感知聚类软件包,通过找到构型之间的最佳配对来修正虚高的RMSD值。该程序将重排方案与Kabsch算法相结合,在运行层次聚类协议之前最小化分子构型的RMSD。通过考虑评估指标,可以自动确定理想阈值并比较可用的不同连锁方案。通过考虑从纯水团簇到不同溶剂中的溶剂化蛋白质或小溶质的溶质 - 溶剂系统来举例说明该程序的功能。结果,我们研究了对不同参数的依赖性,例如系统大小和重排方法,以及聚类质心对于光学性质表征的代表性。clusttraj作为一个Python库实现,可用于对超越溶质 - 溶剂系统的分子构型通用集合进行聚类。