什么时候化学相似性具有重要意义？化学相似性得分的统计分布及其极值。

When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values.

机构信息

School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA.

出版信息

J Chem Inf Model. 2010 Jul 26;50(7):1205-22. doi: 10.1021/ci100010v.

DOI:10.1021/ci100010v

PMID:20540577

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2914517/

Abstract

As repositories of chemical molecules continue to expand and become more open, it becomes increasingly important to develop tools to search them efficiently and assess the statistical significance of chemical similarity scores. Here, we develop a general framework for understanding, modeling, predicting, and approximating the distribution of chemical similarity scores and its extreme values in large databases. The framework can be applied to different chemical representations and similarity measures but is demonstrated here using the most common binary fingerprints with the Tanimoto similarity measure. After introducing several probabilistic models of fingerprints, including the Conditional Gaussian Uniform model, we show that the distribution of Tanimoto scores can be approximated by the distribution of the ratio of two correlated Normal random variables associated with the corresponding unions and intersections. This remains true also when the distribution of similarity scores is conditioned on the size of the query molecules to derive more fine-grained results and improve chemical retrieval. The corresponding extreme value distributions for the maximum scores are approximated by Weibull distributions. From these various distributions and their analytical forms, Z-scores, E-values, and p-values are derived to assess the significance of similarity scores. In addition, the framework also allows one to predict the value of standard chemical retrieval metrics, such as sensitivity and specificity at fixed thresholds, or receiver operating characteristic (ROC) curves at multiple thresholds, and to detect outliers in the form of atypical molecules. Numerous and diverse experiments that have been performed, in part with large sets of molecules from the ChemDB, show remarkable agreement between theory and empirical results.

摘要

随着化学分子数据库的不断扩展和开放，开发高效的搜索工具并评估化学相似性得分的统计显著性变得越来越重要。在这里，我们开发了一个通用框架，用于理解、建模、预测和近似大型数据库中化学相似性得分及其极值的分布。该框架可应用于不同的化学表示和相似性度量，但在这里使用最常见的二进制指纹和 Tanimoto 相似性度量进行了演示。在介绍了几种指纹的概率模型，包括条件高斯均匀模型之后，我们表明 Tanimoto 得分的分布可以通过与相应并集和交集相关联的两个相关正态随机变量的比值的分布来近似。即使在相似性得分的分布被条件化到查询分子的大小上以得出更细粒度的结果和改进化学检索时，这也是正确的。对于最大得分的相应极值分布，可以通过 Weibull 分布来近似。从这些各种分布及其分析形式，可以推导出 Z 分数、E 值和 p 值来评估相似性得分的显著性。此外，该框架还允许预测标准化学检索指标的值，例如在固定阈值下的敏感性和特异性，或在多个阈值下的接收者操作特征 (ROC) 曲线，并以非典型分子的形式检测异常值。已经进行了大量和多样化的实验，部分实验使用了 ChemDB 中的大型分子集，理论和经验结果之间存在显著的一致性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c3e/2914517/3fcbb3e73f20/nihms213669f1.jpg

相似文献

When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values.什么时候化学相似性具有重要意义？化学相似性得分的统计分布及其极值。

J Chem Inf Model. 2010 Jul 26;50(7):1205-22. doi: 10.1021/ci100010v.

BLASTing small molecules--statistics and extreme statistics of chemical similarity scores.小分子的BLAST比对——化学相似性分数的统计与极端统计

Bioinformatics. 2008 Jul 1;24(13):i357-65. doi: 10.1093/bioinformatics/btn187.

ccbmlib - a Python package for modeling Tanimoto similarity value distributions.ccbmlib - 一个用于对谷本相似度值分布进行建模的Python包。

F1000Res. 2020 Feb 10;9. doi: 10.12688/f1000research.22292.2. eCollection 2020.

Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time.线性和亚线性时间内化学指纹快速精确搜索的边界与算法

J Chem Inf Model. 2007 Mar-Apr;47(2):302-17. doi: 10.1021/ci600358f. Epub 2007 Feb 28.

Introduction of the conditional correlated Bernoulli model of similarity value distributions and its application to the prospective prediction of fingerprint search performance.条件相关的 Bernoulli 相似值分布模型介绍及其在指纹搜索性能的前瞻性预测中的应用。

J Chem Inf Model. 2011 Oct 24;51(10):2496-506. doi: 10.1021/ci2003472. Epub 2011 Sep 16.

Modeling Tanimoto Similarity Value Distributions and Predicting Search Results.模拟谷本相似度值分布并预测搜索结果。

Mol Inform. 2017 Jul;36(7). doi: 10.1002/minf.201600131. Epub 2016 Dec 29.

Bit silencing in fingerprints enables the derivation of compound class-directed similarity metrics.指纹中的位沉默能够推导出化合物类别导向的相似性度量。

J Chem Inf Model. 2008 Sep;48(9):1754-9. doi: 10.1021/ci8002045. Epub 2008 Aug 13.

Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel's ravens.基于分子指纹的欧几里得化学空间：汉明距离与亨佩尔的乌鸦悖论

J Comput Aided Mol Des. 2015 May;29(5):387-95. doi: 10.1007/s10822-014-9819-y. Epub 2014 Dec 5.

Mathematical correction for fingerprint similarity measures to improve chemical retrieval.用于指纹相似性度量的数学校正以改进化学检索。

J Chem Inf Model. 2007 May-Jun;47(3):952-64. doi: 10.1021/ci600526a. Epub 2007 Apr 20.

"Bayes affinity fingerprints" improve retrieval rates in virtual screening and define orthogonal bioactivity space: when are multitarget drugs a feasible concept?“贝叶斯亲和力指纹图谱”提高虚拟筛选中的检索率并定义正交生物活性空间：多靶点药物何时成为可行概念？

J Chem Inf Model. 2006 Nov-Dec;46(6):2445-56. doi: 10.1021/ci600197y.

引用本文的文献

Do Molecular Fingerprints Identify Diverse Active Drugs in Large-Scale Virtual Screening? (No).分子指纹图谱能否在大规模虚拟筛选中识别出多种活性药物？（不能）

Pharmaceuticals (Basel). 2024 Jul 26;17(8):992. doi: 10.3390/ph17080992.

uafR: An R package that automates mass spectrometry data processing.uafR：一个自动化质谱数据分析处理的 R 包。

PLoS One. 2024 Jul 5;19(7):e0306202. doi: 10.1371/journal.pone.0306202. eCollection 2024.

Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models.使用机器学习预测有机化合物的水溶性：基于描述符和基于指纹的模型的比较研究

J Cheminform. 2023 Oct 18;15(1):99. doi: 10.1186/s13321-023-00752-6.

APDB: a database on air pollutant characterization and similarity prediction.APDB：一个关于空气污染物特征描述和相似性预测的数据库。

Database (Oxford). 2023 Jul 14;2023. doi: 10.1093/database/baad046.

Random-forest model for drug-target interaction prediction via Kullbeck-Leibler divergence.基于库尔贝克-莱布勒散度的药物-靶点相互作用预测随机森林模型。

J Cheminform. 2022 Oct 3;14(1):67. doi: 10.1186/s13321-022-00644-1.

CLiB - a novel cardiolipin-binder isolated data-driven and screening.CLiB——一种通过数据驱动筛选分离出的新型心磷脂结合剂。

RSC Chem Biol. 2022 Jun 10;3(7):941-954. doi: 10.1039/d2cb00125j. eCollection 2022 Jul 6.

Bioactivity assessment of natural compounds using machine learning models trained on target similarity between drugs.基于药物间靶标相似性训练的机器学习模型对天然化合物的生物活性评估

PLoS Comput Biol. 2022 Apr 25;18(4):e1010029. doi: 10.1371/journal.pcbi.1010029. eCollection 2022 Apr.

Pocket2Drug: An Encoder-Decoder Deep Neural Network for the Target-Based Drug Design.口袋到药物：一种用于基于靶点的药物设计的编解码器深度神经网络。

Front Pharmacol. 2022 Mar 11;13:837715. doi: 10.3389/fphar.2022.837715. eCollection 2022.

CHARMM-GUI for Template-Based Virtual Ligand Design in a Binding Site.CHARMM-GUI 用于结合部位基于模板的虚拟配体设计。

J Chem Inf Model. 2021 Nov 22;61(11):5336-5342. doi: 10.1021/acs.jcim.1c01156. Epub 2021 Nov 10.

A palette of fluorophores that are differentially accumulated by wild-type and mutant strains of : surrogate ligands for profiling bacterial membrane transporters.一组荧光团，通过野生型和突变型菌株的差异积累：用于分析细菌膜转运蛋白的替代配体。

Microbiology (Reading). 2021 Feb;167(2). doi: 10.1099/mic.0.001016.

本文引用的文献

PubChem: a public information system for analyzing bioactivities of small molecules.PubChem：一个用于分析小分子生物活性的公共信息系统。

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W623-33. doi: 10.1093/nar/gkp456. Epub 2009 Jun 4.

BLASTing small molecules--statistics and extreme statistics of chemical similarity scores.小分子的BLAST比对——化学相似性分数的统计与极端统计

Bioinformatics. 2008 Jul 1;24(13):i357-65. doi: 10.1093/bioinformatics/btn187.

Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.使用整数熵编码对化学指纹进行无损压缩可改善存储和检索。

J Chem Inf Model. 2007 Nov-Dec;47(6):2098-109. doi: 10.1021/ci700200n. Epub 2007 Oct 30.

ChemDB update--full-text search and virtual chemical space.化学数据库更新——全文搜索与虚拟化学空间

Bioinformatics. 2007 Sep 1;23(17):2348-51. doi: 10.1093/bioinformatics/btm341. Epub 2007 Jun 28.

Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time.线性和亚线性时间内化学指纹快速精确搜索的边界与算法

J Chem Inf Model. 2007 Mar-Apr;47(2):302-17. doi: 10.1021/ci600358f. Epub 2007 Feb 28.

Relating protein pharmacology by ligand chemistry.通过配体化学关联蛋白质药理学。

Nat Biotechnol. 2007 Feb;25(2):197-206. doi: 10.1038/nbt1284.

Database resources of the National Center for Biotechnology Information.美国国立生物技术信息中心的数据库资源。

Nucleic Acids Res. 2007 Jan;35(Database issue):D5-12. doi: 10.1093/nar/gkl1031. Epub 2006 Dec 14.

Cheminformatics analysis and learning in a data pipelining environment.数据管道环境中的化学信息学分析与学习

Mol Divers. 2006 Aug;10(3):283-99. doi: 10.1007/s11030-006-9041-5. Epub 2006 Sep 22.

ChemDB: a public database of small molecules and related chemoinformatics resources.化学数据库（ChemDB）：一个小分子及相关化学信息学资源的公共数据库。

Bioinformatics. 2005 Nov 15;21(22):4133-9. doi: 10.1093/bioinformatics/bti683. Epub 2005 Sep 20.

ZINC--a free database of commercially available compounds for virtual screening.锌数据库——一个可用于虚拟筛选的商业可用化合物免费数据库。

J Chem Inf Model. 2005 Jan-Feb;45(1):177-82. doi: 10.1021/ci049714+.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验