Kapp-Joswig Jan-Oliver, Keller Bettina G
Department of Theoretical Chemistry, Freie Universität Berlin, Arnimallee 22, 14195Berlin, Germany.
J Chem Inf Model. 2023 Feb 27;63(4):1093-1098. doi: 10.1021/acs.jcim.2c01493. Epub 2023 Feb 6.
Density-based clustering procedures are widely used in a variety of data science applications. Their advantage lies in the capability to find arbitrarily shaped and sized clusters and robustness against outliers. In particular, they proved effective in the analysis of molecular dynamics simulations, where they serve to identify relevant, low-energetic molecular conformations. As such, they can provide a convenient basis for the construction of kinetic (core-set) Markov-state models. Here we present the open-source Python project CommonNNClustering, which provides an easy-to-use and efficient reimplementation of the common-nearest-neighbor (CommonNN) method. The package provides functionalities for hierarchical clustering and an evaluation of the results. We put our emphasis on a generic API design to keep the implementation flexible and open for customization.
基于密度的聚类方法在各种数据科学应用中被广泛使用。它们的优势在于能够找到任意形状和大小的聚类,并且对异常值具有鲁棒性。特别是,它们在分子动力学模拟分析中被证明是有效的,在该分析中用于识别相关的低能量分子构象。因此,它们可以为构建动力学(核心集)马尔可夫状态模型提供便利的基础。在这里,我们展示了开源Python项目CommonNNClustering,它提供了一种易于使用且高效的共同最近邻(CommonNN)方法的重新实现。该软件包提供了用于层次聚类和结果评估的功能。我们强调通用API设计,以使实现保持灵活并开放以供定制。