• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

排序与切片:一种用于扩展连接性指纹的、比基于哈希的折叠更简单且更优的替代方法。

Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints.

作者信息

Dablander Markus, Hanser Thierry, Lambiotte Renaud, Morris Garrett M

机构信息

Mathematical Institute, University of Oxford, Andrew Wiles Building, Radcliffe Observatory Quarter (550), Woodstock Road, Oxford, OX2 6GG, UK.

Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds, LS11 5PS, UK.

出版信息

J Cheminform. 2024 Dec 3;16(1):135. doi: 10.1186/s13321-024-00932-y.

DOI:10.1186/s13321-024-00932-y
PMID:39627861
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11616156/
Abstract

Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the L most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, L. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning. Scientific contribution A general mathematical framework for the vectorisation of structural fingerprints called substructure pooling; and the technical description and computational evaluation of Sort & Slice, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.

摘要

扩展连接指纹(ECFPs)是当前化学信息学和分子机器学习中普遍使用的工具,也是用于化学预测的最流行的分子特征提取技术之一。通过图神经网络学习到的原子特征可以使用多种图池化方法聚合为化合物级别的表示。相比之下,检测到的ECFP子结构集默认情况下仅使用简单的基于哈希的折叠过程转换为位向量。我们通过一种称为子结构池化的形式化操作引入了一个用于结构指纹向量化的通用数学框架,该操作涵盖基于哈希的折叠、算法子结构选择以及各种其他潜在技术。我们接着描述了排序与切片(Sort & Slice),这是一种易于实现且无位冲突的替代基于哈希折叠的方法,用于ECFP子结构的池化。排序与切片首先根据ECFP子结构在给定训练化合物集中的相对出现频率对其进行排序,然后除了最频繁出现的L个子结构之外,舍弃所有其他子结构,随后使用这些子结构生成所需长度为L的二进制指纹。我们通过计算比较了基于ECFP的分子性质预测中基于哈希的折叠、排序与切片以及两种先进的监督子结构选择方案(过滤和互信息最大化)的性能。我们的结果表明,尽管排序与切片技术简单,但在不同的预测任务、数据分割技术、机器学习模型和ECFP超参数方面,它稳健地(有时显著地)优于传统的基于哈希的折叠以及其他研究的子结构池化方法。因此,我们建议排序与切片规范地取代基于哈希的折叠,作为将ECFPs向量化以用于监督分子机器学习的默认子结构池化技术。科学贡献:一个用于结构指纹向量化的通用数学框架,称为子结构池化;以及排序与切片的技术描述和计算评估,这是一种概念简单且无位冲突的方法,用于ECFP子结构的池化,在分子性质预测方面稳健且显著地优于经典的基于哈希的折叠。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/d8217be7de0a/13321_2024_932_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/3608be6d0cd2/13321_2024_932_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/2d3d0d4edad4/13321_2024_932_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/00310d9ca142/13321_2024_932_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/bb34dbbb7ae8/13321_2024_932_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/0f0726534b50/13321_2024_932_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/dbb1882a3aa0/13321_2024_932_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/d8217be7de0a/13321_2024_932_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/3608be6d0cd2/13321_2024_932_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/2d3d0d4edad4/13321_2024_932_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/00310d9ca142/13321_2024_932_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/bb34dbbb7ae8/13321_2024_932_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/0f0726534b50/13321_2024_932_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/dbb1882a3aa0/13321_2024_932_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13d/11616156/d8217be7de0a/13321_2024_932_Fig7_HTML.jpg

相似文献

1
Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints.排序与切片:一种用于扩展连接性指纹的、比基于哈希的折叠更简单且更优的替代方法。
J Cheminform. 2024 Dec 3;16(1):135. doi: 10.1186/s13321-024-00932-y.
2
Improving the search performance of extended connectivity fingerprints through activity-oriented feature filtering and application of a bit-density-dependent similarity function.通过面向活性的特征过滤和应用基于位密度的相似性函数来提高扩展连接指纹的搜索性能。
ChemMedChem. 2009 Apr;4(4):540-8. doi: 10.1002/cmdc.200800408.
3
Extended-connectivity fingerprints.扩展连接指纹。
J Chem Inf Model. 2010 May 24;50(5):742-54. doi: 10.1021/ci100050t.
4
Neural networks prediction of the protein-ligand binding affinity with circular fingerprints.基于循环指纹的蛋白质配体结合亲和力的神经网络预测。
Technol Health Care. 2023;31(S1):487-495. doi: 10.3233/THC-236042.
5
Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition.Mol2vec:具有化学直觉的无监督机器学习方法。
J Chem Inf Model. 2018 Jan 22;58(1):27-35. doi: 10.1021/acs.jcim.7b00616. Epub 2018 Jan 10.
6
Hierarchical Recurrent Neural Hashing for Image Retrieval With Hierarchical Convolutional Features.基于层次卷积特征的层次递归神经网络哈希图像检索
IEEE Trans Image Process. 2018;27(1):106-120. doi: 10.1109/TIP.2017.2755766.
7
Prototype-based contrastive substructure identification for molecular property prediction.基于原型的对比子结构识别在分子性质预测中的应用。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae565.
8
Neuraldecipher - reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures.神经解密——将扩展连接指纹(ECFPs)逆向工程为其分子结构。
Chem Sci. 2020 Sep 11;11(38):10378-10389. doi: 10.1039/d0sc03115a.
9
Molecular identification via molecular fingerprint extraction from atomic force microscopy images.通过从原子力显微镜图像中提取分子指纹进行分子鉴定。
J Cheminform. 2024 Nov 25;16(1):130. doi: 10.1186/s13321-024-00921-1.
10
A probabilistic molecular fingerprint for big data settings.一种适用于大数据环境的概率分子指纹。
J Cheminform. 2018 Dec 18;10(1):66. doi: 10.1186/s13321-018-0321-8.

本文引用的文献

1
Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants' Activities and Properties.基于计数的摩根指纹:在开发用于预测水中污染物活性和性质的基于机器学习的回归模型中,一种更高效且可解释的分子表示方法。
Environ Sci Technol. 2023 Nov 21;57(46):18193-18202. doi: 10.1021/acs.est.3c02198. Epub 2023 Jul 5.
2
Exploring QSAR models for activity-cliff prediction.探索用于活性悬崖预测的定量构效关系模型。
J Cheminform. 2023 Apr 17;15(1):47. doi: 10.1186/s13321-023-00708-w.
3
Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments.
通过原子环境的神经机器翻译预测反合成反应途径。
Nat Commun. 2022 Mar 4;13(1):1186. doi: 10.1038/s41467-022-28857-w.
4
A compact review of molecular property prediction with graph neural networks.图神经网络在分子性质预测中的应用综述
Drug Discov Today Technol. 2020 Dec;37:1-12. doi: 10.1016/j.ddtec.2020.11.009. Epub 2020 Dec 17.
5
Neuraldecipher - reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures.神经解密——将扩展连接指纹(ECFPs)逆向工程为其分子结构。
Chem Sci. 2020 Sep 11;11(38):10378-10389. doi: 10.1039/d0sc03115a.
6
Using Domain-Specific Fingerprints Generated Through Neural Networks to Enhance Ligand-Based Virtual Screening.利用神经网络生成的领域特定指纹增强基于配体的虚拟筛选。
J Chem Inf Model. 2021 Feb 22;61(2):664-675. doi: 10.1021/acs.jcim.0c01208. Epub 2021 Jan 26.
7
An open source chemical structure curation pipeline using RDKit.一个使用RDKit的开源化学结构编目流程。
J Cheminform. 2020 Sep 1;12(1):51. doi: 10.1186/s13321-020-00456-1.
8
One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome.一种分子指纹统御万物:药物、生物分子与代谢组。
J Cheminform. 2020 Jun 12;12(1):43. doi: 10.1186/s13321-020-00445-4.
9
A comprehensive comparison of molecular feature representations for use in predictive modeling.综合比较用于预测建模的分子特征表示。
Comput Biol Med. 2021 Mar;130:104197. doi: 10.1016/j.compbiomed.2020.104197. Epub 2021 Jan 9.
10
Molecular property prediction: recent trends in the era of artificial intelligence.分子性质预测:人工智能时代的最新趋势。
Drug Discov Today Technol. 2019 Dec;32-33:29-36. doi: 10.1016/j.ddtec.2020.05.001. Epub 2020 Jul 1.