Cancer Research Center of Marseille (INSERM U1068, Institut Paoli-Calmettes, Aix-Marseille Université UM105, CNRS UMR7258), 13009 Marseille, France.
Department of Bioengineering, Imperial College London, London SW7 2AZ, UK.
Biomolecules. 2023 Mar 8;13(3):498. doi: 10.3390/biom13030498.
Machine learning-based models have been widely used in the early drug-design pipeline. To validate these models, cross-validation strategies have been employed, including those using clustering of molecules in terms of their chemical structures. However, the poor clustering of compounds will compromise such validation, especially on test molecules dissimilar to those in the training set. This study aims at finding the best way to cluster the molecules screened by the National Cancer Institute (NCI)-60 project by comparing hierarchical, Taylor-Butina, and uniform manifold approximation and projection (UMAP) clustering methods. The best-performing algorithm can then be used to generate clusters for model validation strategies. This study also aims at measuring the impact of removing outlier molecules prior to the clustering step. Clustering results are evaluated using three well-known clustering quality metrics. In addition, we compute an average similarity matrix to assess the quality of each cluster. The results show variation in clustering quality from method to method. The clusters obtained by the hierarchical and Taylor-Butina methods are more computationally expensive to use in cross-validation strategies, and both cluster the molecules poorly. In contrast, the UMAP method provides the best quality, and therefore we recommend it to analyze this highly valuable dataset.
基于机器学习的模型已广泛应用于早期药物设计管道中。为了验证这些模型,已经采用了交叉验证策略,包括根据分子的化学结构对分子进行聚类。然而,化合物聚类效果不佳会影响这种验证,特别是对于与训练集中的分子不相似的测试分子。本研究旨在通过比较层次聚类、Taylor-Butina 聚类和一致流形逼近与投影 (UMAP) 聚类方法,找到 NCI-60 项目筛选的分子最佳聚类方法。然后可以使用表现最佳的算法为模型验证策略生成聚类。本研究还旨在衡量在聚类步骤之前去除离群分子对聚类的影响。使用三个著名的聚类质量指标评估聚类结果。此外,我们计算了一个平均相似度矩阵来评估每个聚类的质量。结果表明,不同方法的聚类质量存在差异。层次聚类和 Taylor-Butina 方法获得的聚类在交叉验证策略中使用起来计算成本更高,并且聚类效果都较差。相比之下,UMAP 方法提供了最佳的质量,因此我们建议使用它来分析这个非常有价值的数据集。