大型分子文库的高效聚类

Efficient clustering of large molecular libraries.

作者信息

Pérez Kenneth López, Jung Vicky, Chen Lexin, Huddleston Kate, Miranda-Quintana Ramón Alain

机构信息

Department of Chemistry & Quantum Theory Project, University of Florida, Gainesville, Florida 32611.

出版信息

bioRxiv. 2024 Aug 10:2024.08.10.607459. doi: 10.1101/2024.08.10.607459.

DOI:10.1101/2024.08.10.607459

PMID:39149242

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11326248/

Abstract

The widespread use of Machine Learning (ML) techniques in chemical applications has come with the pressing need to analyze extremely large molecular libraries. In particular, clustering remains one of the most common tools to dissect the chemical space. Unfortunately, most current approaches present unfavorable time and memory scaling, which makes them unsuitable to handle million- and billion-sized sets. Here, we propose to bypass these problems with a time- and memory-efficient clustering algorithm, BitBIRCH. This method uses a tree structure similar to the one found in the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm to ensure time scaling. BitBIRCH leverages the instant similarity (iSIM) formalism to process binary fingerprints, allowing the use of Tanimoto similarity, and reducing memory requirements. Our tests show that BitBIRCH is already > 1,000 times faster than standard implementations of the Taylor-Butina clustering for libraries with 1,500,000 molecules. BitBIRCH increases efficiency without compromising the quality of the resulting clusters. We explore strategies to handle large sets, which we applied in the clustering of one billion molecules under 5 hours using a parallel/iterative BitBIRCH approximation.

摘要

机器学习（ML）技术在化学应用中的广泛使用带来了分析超大型分子库的迫切需求。特别是，聚类仍然是剖析化学空间最常用的工具之一。不幸的是，当前大多数方法在时间和内存扩展方面存在不利因素，这使得它们不适用于处理数百万和数十亿规模的数据集。在此，我们提出使用一种时间和内存高效的聚类算法BitBIRCH来绕过这些问题。该方法使用一种类似于层次平衡迭代规约与聚类（BIRCH）算法中的树结构来确保时间扩展。BitBIRCH利用即时相似性（iSIM）形式来处理二进制指纹，允许使用塔尼莫托相似性，并降低内存需求。我们的测试表明，对于包含150万个分子的库，BitBIRCH比泰勒 - 布蒂纳聚类的标准实现快1000倍以上。BitBIRCH在不影响聚类结果质量的情况下提高了效率。我们探索了处理大型数据集的策略，并使用并行/迭代的BitBIRCH近似方法在5小时内对10亿个分子进行了聚类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2368/11326248/f8bbb8a17ecd/nihpp-2024.08.10.607459v1-f0001.jpg

相似文献

Efficient clustering of large molecular libraries.大型分子文库的高效聚类

bioRxiv. 2024 Aug 10:2024.08.10.607459. doi: 10.1101/2024.08.10.607459.

BitBIRCH: efficient clustering of large molecular libraries.BitBIRCH：大型分子文库的高效聚类

Digit Discov. 2025 Mar 13;4(4):1042-1051. doi: 10.1039/d5dd00030k. eCollection 2025 Apr 9.

BitBIRCH Clustering Refinement Strategies.BitBIRCH聚类优化策略。

bioRxiv. 2025 Mar 24:2025.03.20.644337. doi: 10.1101/2025.03.20.644337.

Growth vs. Diversity: A Time-Evolution Analysis of the Chemical Space.增长与多样性：化学空间的时间演化分析

bioRxiv. 2025 Feb 23:2025.02.18.638937. doi: 10.1101/2025.02.18.638937.

iSIM: instant similarity.iSIM：即时相似度。

Digit Discov. 2024 May 7;3(6):1160-1171. doi: 10.1039/d4dd00041b. eCollection 2024 Jun 12.

Bioinformatics. 2010 Apr 1;26(7):953-9. doi: 10.1093/bioinformatics/btq067. Epub 2010 Feb 23.

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules.iRaPCA 和 SOMoC：用于小分子聚类新方法的 Web 应用程序的开发和验证。

J Chem Inf Model. 2022 Jun 27;62(12):2987-2998. doi: 10.1021/acs.jcim.2c00265. Epub 2022 Jun 10.

Blocked inverted indices for exact clustering of large chemical spaces.用于大型化学空间精确聚类的阻塞倒排索引。

J Chem Inf Model. 2014 Sep 22;54(9):2395-401. doi: 10.1021/ci500150t. Epub 2014 Sep 2.

On the Best Way to Cluster NCI-60 Molecules.基于 NCI-60 分子的聚类最佳方法。

Biomolecules. 2023 Mar 8;13(3):498. doi: 10.3390/biom13030498.

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets.一种改进的超平面聚类算法能够对超大型数据集进行高效且准确的聚类。

Bioinformatics. 2009 May 1;25(9):1152-7. doi: 10.1093/bioinformatics/btp123. Epub 2009 Mar 4.

本文引用的文献

iSIM: instant similarity.iSIM：即时相似度。

Digit Discov. 2024 May 7;3(6):1160-1171. doi: 10.1039/d4dd00041b. eCollection 2024 Jun 12.

Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search.利用低维分子嵌入进行快速化学相似性搜索。

Adv Inf Retr. 2024 Mar;14609:34-49. doi: 10.1007/978-3-031-56060-6_3. Epub 2024 Mar 16.

Exploring the known chemical space of the plant kingdom: insights into taxonomic patterns, knowledge gaps, and bioactive regions.探索植物王国的已知化学空间：洞察分类模式、知识空白和生物活性区域。

J Cheminform. 2023 Nov 10;15(1):107. doi: 10.1186/s13321-023-00778-w.

Sampling and Mapping Chemical Space with Extended Similarity Indices.使用扩展相似性指数进行化学空间的采样与映射

Molecules. 2023 Aug 30;28(17):6333. doi: 10.3390/molecules28176333.

Exploring activity landscapes with extended similarity: is Tanimoto enough?用扩展相似度探索活动景观：Tanimoto 足够吗？

Mol Inform. 2023 Jul;42(7):e2300056. doi: 10.1002/minf.202300056. Epub 2023 Jun 7.

On the Best Way to Cluster NCI-60 Molecules.基于 NCI-60 分子的聚类最佳方法。

Biomolecules. 2023 Mar 8;13(3):498. doi: 10.3390/biom13030498.

ZINC-22─A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery.ZINC-22─一个免费的、数十亿规模的有形化合物数据库，用于配体发现。

J Chem Inf Model. 2023 Feb 27;63(4):1166-1176. doi: 10.1021/acs.jcim.2c01253. Epub 2023 Feb 15.

Data clustering: application and trends.数据聚类：应用与趋势

Artif Intell Rev. 2023;56(7):6439-6475. doi: 10.1007/s10462-022-10325-y. Epub 2022 Nov 27.

Exposing the Limitations of Molecular Machine Learning with Activity Cliffs.利用活性悬崖揭示分子机器学习的局限性。

J Chem Inf Model. 2022 Dec 12;62(23):5938-5951. doi: 10.1021/acs.jcim.2c01073. Epub 2022 Dec 1.

Chemical space docking enables large-scale structure-based virtual screening to discover ROCK1 kinase inhibitors.化学空间对接使基于结构的大规模虚拟筛选能够发现 ROCK1 激酶抑制剂。

Nat Commun. 2022 Oct 28;13(1):6447. doi: 10.1038/s41467-022-33981-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

大型分子文库的高效聚类

Efficient clustering of large molecular libraries.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献