Suppr超能文献

什么时候化学相似性具有重要意义?化学相似性得分的统计分布及其极值。

When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values.

机构信息

School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA.

出版信息

J Chem Inf Model. 2010 Jul 26;50(7):1205-22. doi: 10.1021/ci100010v.

Abstract

As repositories of chemical molecules continue to expand and become more open, it becomes increasingly important to develop tools to search them efficiently and assess the statistical significance of chemical similarity scores. Here, we develop a general framework for understanding, modeling, predicting, and approximating the distribution of chemical similarity scores and its extreme values in large databases. The framework can be applied to different chemical representations and similarity measures but is demonstrated here using the most common binary fingerprints with the Tanimoto similarity measure. After introducing several probabilistic models of fingerprints, including the Conditional Gaussian Uniform model, we show that the distribution of Tanimoto scores can be approximated by the distribution of the ratio of two correlated Normal random variables associated with the corresponding unions and intersections. This remains true also when the distribution of similarity scores is conditioned on the size of the query molecules to derive more fine-grained results and improve chemical retrieval. The corresponding extreme value distributions for the maximum scores are approximated by Weibull distributions. From these various distributions and their analytical forms, Z-scores, E-values, and p-values are derived to assess the significance of similarity scores. In addition, the framework also allows one to predict the value of standard chemical retrieval metrics, such as sensitivity and specificity at fixed thresholds, or receiver operating characteristic (ROC) curves at multiple thresholds, and to detect outliers in the form of atypical molecules. Numerous and diverse experiments that have been performed, in part with large sets of molecules from the ChemDB, show remarkable agreement between theory and empirical results.

摘要

随着化学分子数据库的不断扩展和开放,开发高效的搜索工具并评估化学相似性得分的统计显著性变得越来越重要。在这里,我们开发了一个通用框架,用于理解、建模、预测和近似大型数据库中化学相似性得分及其极值的分布。该框架可应用于不同的化学表示和相似性度量,但在这里使用最常见的二进制指纹和 Tanimoto 相似性度量进行了演示。在介绍了几种指纹的概率模型,包括条件高斯均匀模型之后,我们表明 Tanimoto 得分的分布可以通过与相应并集和交集相关联的两个相关正态随机变量的比值的分布来近似。即使在相似性得分的分布被条件化到查询分子的大小上以得出更细粒度的结果和改进化学检索时,这也是正确的。对于最大得分的相应极值分布,可以通过 Weibull 分布来近似。从这些各种分布及其分析形式,可以推导出 Z 分数、E 值和 p 值来评估相似性得分的显著性。此外,该框架还允许预测标准化学检索指标的值,例如在固定阈值下的敏感性和特异性,或在多个阈值下的接收者操作特征 (ROC) 曲线,并以非典型分子的形式检测异常值。已经进行了大量和多样化的实验,部分实验使用了 ChemDB 中的大型分子集,理论和经验结果之间存在显著的一致性。

相似文献

引用本文的文献

2
uafR: An R package that automates mass spectrometry data processing.uafR:一个自动化质谱数据分析处理的 R 包。
PLoS One. 2024 Jul 5;19(7):e0306202. doi: 10.1371/journal.pone.0306202. eCollection 2024.
9
CHARMM-GUI for Template-Based Virtual Ligand Design in a Binding Site.CHARMM-GUI 用于结合部位基于模板的虚拟配体设计。
J Chem Inf Model. 2021 Nov 22;61(11):5336-5342. doi: 10.1021/acs.jcim.1c01156. Epub 2021 Nov 10.

本文引用的文献

4
ChemDB update--full-text search and virtual chemical space.化学数据库更新——全文搜索与虚拟化学空间
Bioinformatics. 2007 Sep 1;23(17):2348-51. doi: 10.1093/bioinformatics/btm341. Epub 2007 Jun 28.
7
Database resources of the National Center for Biotechnology Information.美国国立生物技术信息中心的数据库资源。
Nucleic Acids Res. 2007 Jan;35(Database issue):D5-12. doi: 10.1093/nar/gkl1031. Epub 2006 Dec 14.
8
Cheminformatics analysis and learning in a data pipelining environment.数据管道环境中的化学信息学分析与学习
Mol Divers. 2006 Aug;10(3):283-99. doi: 10.1007/s11030-006-9041-5. Epub 2006 Sep 22.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验