iRaPCA 和 SOMoC：用于小分子聚类新方法的 Web 应用程序的开发和验证。

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules.

机构信息

Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina.

出版信息

J Chem Inf Model. 2022 Jun 27;62(12):2987-2998. doi: 10.1021/acs.jcim.2c00265. Epub 2022 Jun 10.

DOI:10.1021/acs.jcim.2c00265

PMID:35687523

Abstract

The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.

摘要

小分子聚类意味着将一组化学结构组织成具有相似特征的较小亚组。聚类在以有代表性的方式对样本化学数据集或库进行分组（例如，从虚拟筛选命中列表中选择要提交实验确认的具有化学多样性的化合物子集，或者在实施机器学习模型时将数据集划分为有代表性的训练和验证集）方面具有重要应用。大多数分子聚类策略都是基于分子指纹和层次聚类算法。本文介绍了两种用于小分子聚类的开源内部方法：迭代随机子空间主成分分析聚类（iRaPCA），这是一种基于特征装袋、降维和 K-均值优化的迭代方法；以及轮廓优化分子聚类（SOMoC），它将分子指纹与统一流形逼近和投影（UMAP）和高斯混合模型算法（GMM）相结合。在基准测试中，这两种聚类方法的性能已在包含 100 到 5000 个小分子的 29 个数据集上进行了检查，将这些结果与另外两种著名的聚类方法 Ward 和 Butina 的结果进行了比较。iRaPCA 和 SOMoC 在这 29 个数据集上的表现始终优于其他两种方法，无论是在簇内和簇间距离方面。iRaPCA 和 SOMoC 都已作为免费的 Web 应用程序和独立应用程序实现，以允许科学界的广泛受众使用它们。

相似文献

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules.

J Chem Inf Model. 2022 Jun 27;62(12):2987-2998. doi: 10.1021/acs.jcim.2c00265. Epub 2022 Jun 10.

On the Best Way to Cluster NCI-60 Molecules.

Biomolecules. 2023 Mar 8;13(3):498. doi: 10.3390/biom13030498.

Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means.

BMC Bioinformatics. 2022 Apr 15;23(Suppl 4):132. doi: 10.1186/s12859-022-04667-1.

Statistical power for cluster analysis.

BMC Bioinformatics. 2022 May 31;23(1):205. doi: 10.1186/s12859-022-04675-1.

DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data.

PLoS Comput Biol. 2022 Apr 11;18(4):e1008885. doi: 10.1371/journal.pcbi.1008885. eCollection 2022 Apr.

The application of Uniform Manifold Approximation and Projection (UMAP) for unconstrained ordination and classification of biological indicators in aquatic ecology.

Sci Total Environ. 2022 Apr 1;815:152365. doi: 10.1016/j.scitotenv.2021.152365. Epub 2021 Dec 25.

Accurate Molecular-Orbital-Based Machine Learning Energies via Unsupervised Clustering of Chemical Space.

J Chem Theory Comput. 2022 Aug 9;18(8):4826-4835. doi: 10.1021/acs.jctc.2c00396. Epub 2022 Jul 20.

Multi-view projected clustering with graph learning.

Neural Netw. 2020 Jun;126:335-346. doi: 10.1016/j.neunet.2020.03.020. Epub 2020 Apr 1.

Virtual screening by a new Clustering-based Weighted Similarity Extreme Learning Machine approach.

PLoS One. 2018 Apr 13;13(4):e0195478. doi: 10.1371/journal.pone.0195478. eCollection 2018.

Machine learning of COVID-19 clinical data identifies population structures with therapeutic potential.

iScience. 2022 Jul 15;25(7):104480. doi: 10.1016/j.isci.2022.104480. Epub 2022 May 31.

引用本文的文献

Unraveling Protein-Metabolite Interactions in Precision Nutrition: A Case Study of Blueberry-Derived Metabolites Using Advanced Computational Methods.

Metabolites. 2024 Aug 3;14(8):430. doi: 10.3390/metabo14080430.

Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow.

J Cheminform. 2024 Aug 16;16(1):101. doi: 10.1186/s13321-024-00894-1.

Integrated virtual screening, molecular modeling and machine learning approaches revealed potential natural inhibitors for epilepsy.

Saudi Pharm J. 2023 Dec;31(12):101835. doi: 10.1016/j.jsps.2023.101835. Epub 2023 Oct 20.

Garbage in, garbage out: how reliable training data improved a virtual screening approach against SARS-CoV-2 MPro.

Front Pharmacol. 2023 Jun 22;14:1193282. doi: 10.3389/fphar.2023.1193282. eCollection 2023.

On the Best Way to Cluster NCI-60 Molecules.

Biomolecules. 2023 Mar 8;13(3):498. doi: 10.3390/biom13030498.

Identification of novel inhibitors for SARS-CoV-2 as therapeutic options using machine learning-based virtual screening, molecular docking and MD simulation.

Front Mol Biosci. 2023 Mar 7;10:1060076. doi: 10.3389/fmolb.2023.1060076. eCollection 2023.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

iRaPCA 和 SOMoC：用于小分子聚类新方法的 Web 应用程序的开发和验证。

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules.

机构信息

Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina.

出版信息

J Chem Inf Model. 2022 Jun 27;62(12):2987-2998. doi: 10.1021/acs.jcim.2c00265. Epub 2022 Jun 10.

DOI:10.1021/acs.jcim.2c00265

PMID:35687523

Abstract

摘要

iRaPCA 和 SOMoC：用于小分子聚类新方法的 Web 应用程序的开发和验证。

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

iRaPCA 和 SOMoC：用于小分子聚类新方法的 Web 应用程序的开发和验证。

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules.

机构信息

出版信息