Pangea Bio, Pangea Botanica GmbH, Hardenbergstrasse 32, 10623 Berlin, Germany.
Centre for Molecular Informatics, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Rd, CB2 1EW Cambridge, United Kingdom.
J Chem Inf Model. 2024 Oct 28;64(20):7905-7916. doi: 10.1021/acs.jcim.4c01314. Epub 2024 Oct 15.
Organizing and partitioning sets of chemical structures is of considerable practical significance, e.g., in compound library analysis and the postprocessing of screening hit lists. Approaches such as unsupervised clustering are computationally demanding and dataset-dependent; on the other hand, rule-based methods, such as those based on Murcko scaffolds, have linear time complexity but are often too fine-grained, leading to a large number of singletons or sparsely populated classes. An alternative rule-based method that seeks to achieve an optimal balance when grouping compounds into sets is the 'Scaffold Identification and Naming System' (SCINS). To facilitate public use of this previously published method, here, we provide an open-source Python implementation of SCINS, dependent only on RDKit. We show that SCINS can be useful in identifying sparsely and densely populated regions in chemical space in large databases, here exemplified with Enamine REAL Diverse and ChEMBL. We find that Enamine REAL Diverse covers a much smaller SCINS space relative to ChEMBL, whereas the opposite is true when Murcko and generic Murcko scaffolds are considered. Additionally, we show that SCINS can result in chemically intuitive grouping of medium-sized sets of bioactive compounds, which can be useful in compound selection from virtual screening campaigns as well as postprocessing of experimental hit lists. Hence, in this work, we provide both an open-source implementation of SCINS and its characterization with relevant use cases.
组织和划分化学结构集具有相当大的实际意义,例如在化合物库分析和筛选命中列表的后处理中。无监督聚类等方法计算量很大且依赖于数据集;另一方面,基于规则的方法,如基于 Murcko 支架的方法,具有线性时间复杂度,但通常过于精细,导致大量单例或稀疏填充的类。一种替代的基于规则的方法,旨在在将化合物分组到集合中时实现最佳平衡,是“支架识别和命名系统”(SCINS)。为了促进对这个先前发布的方法的公共使用,在这里,我们提供了一个仅依赖 RDKit 的 SCINS 的开源 Python 实现。我们表明,SCINS 可用于识别大型数据库中化学空间中稀疏和密集的区域,这里以 Enamine REAL Diverse 和 ChEMBL 为例。我们发现,相对于 ChEMBL,Enamine REAL Diverse 覆盖的 SCINS 空间要小得多,而当考虑 Murcko 和通用 Murcko 支架时则相反。此外,我们表明,SCINS 可以对中等大小的生物活性化合物集进行化学上直观的分组,这在虚拟筛选活动中的化合物选择以及实验命中列表的后处理中非常有用。因此,在这项工作中,我们提供了 SCINS 的开源实现及其相关用例的特征描述。