Dablander Markus, Hanser Thierry, Lambiotte Renaud, Morris Garrett M
Mathematical Institute, University of Oxford, Andrew Wiles Building, Radcliffe Observatory Quarter (550), Woodstock Road, Oxford, OX2 6GG, UK.
Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds, LS11 5PS, UK.
J Cheminform. 2024 Dec 3;16(1):135. doi: 10.1186/s13321-024-00932-y.
Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the L most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, L. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning. Scientific contribution A general mathematical framework for the vectorisation of structural fingerprints called substructure pooling; and the technical description and computational evaluation of Sort & Slice, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.
扩展连接指纹(ECFPs)是当前化学信息学和分子机器学习中普遍使用的工具,也是用于化学预测的最流行的分子特征提取技术之一。通过图神经网络学习到的原子特征可以使用多种图池化方法聚合为化合物级别的表示。相比之下,检测到的ECFP子结构集默认情况下仅使用简单的基于哈希的折叠过程转换为位向量。我们通过一种称为子结构池化的形式化操作引入了一个用于结构指纹向量化的通用数学框架,该操作涵盖基于哈希的折叠、算法子结构选择以及各种其他潜在技术。我们接着描述了排序与切片(Sort & Slice),这是一种易于实现且无位冲突的替代基于哈希折叠的方法,用于ECFP子结构的池化。排序与切片首先根据ECFP子结构在给定训练化合物集中的相对出现频率对其进行排序,然后除了最频繁出现的L个子结构之外,舍弃所有其他子结构,随后使用这些子结构生成所需长度为L的二进制指纹。我们通过计算比较了基于ECFP的分子性质预测中基于哈希的折叠、排序与切片以及两种先进的监督子结构选择方案(过滤和互信息最大化)的性能。我们的结果表明,尽管排序与切片技术简单,但在不同的预测任务、数据分割技术、机器学习模型和ECFP超参数方面,它稳健地(有时显著地)优于传统的基于哈希的折叠以及其他研究的子结构池化方法。因此,我们建议排序与切片规范地取代基于哈希的折叠,作为将ECFPs向量化以用于监督分子机器学习的默认子结构池化技术。科学贡献:一个用于结构指纹向量化的通用数学框架,称为子结构池化;以及排序与切片的技术描述和计算评估,这是一种概念简单且无位冲突的方法,用于ECFP子结构的池化,在分子性质预测方面稳健且显著地优于经典的基于哈希的折叠。