Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina.
J Chem Inf Model. 2022 Jun 27;62(12):2987-2998. doi: 10.1021/acs.jcim.2c00265. Epub 2022 Jun 10.
The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.
小分子聚类意味着将一组化学结构组织成具有相似特征的较小亚组。聚类在以有代表性的方式对样本化学数据集或库进行分组(例如,从虚拟筛选命中列表中选择要提交实验确认的具有化学多样性的化合物子集,或者在实施机器学习模型时将数据集划分为有代表性的训练和验证集)方面具有重要应用。大多数分子聚类策略都是基于分子指纹和层次聚类算法。本文介绍了两种用于小分子聚类的开源内部方法:迭代随机子空间主成分分析聚类(iRaPCA),这是一种基于特征装袋、降维和 K-均值优化的迭代方法;以及轮廓优化分子聚类(SOMoC),它将分子指纹与统一流形逼近和投影(UMAP)和高斯混合模型算法(GMM)相结合。在基准测试中,这两种聚类方法的性能已在包含 100 到 5000 个小分子的 29 个数据集上进行了检查,将这些结果与另外两种著名的聚类方法 Ward 和 Butina 的结果进行了比较。iRaPCA 和 SOMoC 在这 29 个数据集上的表现始终优于其他两种方法,无论是在簇内和簇间距离方面。iRaPCA 和 SOMoC 都已作为免费的 Web 应用程序和独立应用程序实现,以允许科学界的广泛受众使用它们。