Pérez Kenneth López, Jung Vicky, Chen Lexin, Huddleston Kate, Miranda-Quintana Ramón Alain
Department of Chemistry & Quantum Theory Project, University of Florida, Gainesville, Florida 32611.
bioRxiv. 2024 Aug 10:2024.08.10.607459. doi: 10.1101/2024.08.10.607459.
The widespread use of Machine Learning (ML) techniques in chemical applications has come with the pressing need to analyze extremely large molecular libraries. In particular, clustering remains one of the most common tools to dissect the chemical space. Unfortunately, most current approaches present unfavorable time and memory scaling, which makes them unsuitable to handle million- and billion-sized sets. Here, we propose to bypass these problems with a time- and memory-efficient clustering algorithm, BitBIRCH. This method uses a tree structure similar to the one found in the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm to ensure time scaling. BitBIRCH leverages the instant similarity (iSIM) formalism to process binary fingerprints, allowing the use of Tanimoto similarity, and reducing memory requirements. Our tests show that BitBIRCH is already > 1,000 times faster than standard implementations of the Taylor-Butina clustering for libraries with 1,500,000 molecules. BitBIRCH increases efficiency without compromising the quality of the resulting clusters. We explore strategies to handle large sets, which we applied in the clustering of one billion molecules under 5 hours using a parallel/iterative BitBIRCH approximation.
机器学习(ML)技术在化学应用中的广泛使用带来了分析超大型分子库的迫切需求。特别是,聚类仍然是剖析化学空间最常用的工具之一。不幸的是,当前大多数方法在时间和内存扩展方面存在不利因素,这使得它们不适用于处理数百万和数十亿规模的数据集。在此,我们提出使用一种时间和内存高效的聚类算法BitBIRCH来绕过这些问题。该方法使用一种类似于层次平衡迭代规约与聚类(BIRCH)算法中的树结构来确保时间扩展。BitBIRCH利用即时相似性(iSIM)形式来处理二进制指纹,允许使用塔尼莫托相似性,并降低内存需求。我们的测试表明,对于包含150万个分子的库,BitBIRCH比泰勒 - 布蒂纳聚类的标准实现快1000倍以上。BitBIRCH在不影响聚类结果质量的情况下提高了效率。我们探索了处理大型数据集的策略,并使用并行/迭代的BitBIRCH近似方法在5小时内对10亿个分子进行了聚类。