School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA.
J Chem Inf Model. 2010 Jul 26;50(7):1205-22. doi: 10.1021/ci100010v.
As repositories of chemical molecules continue to expand and become more open, it becomes increasingly important to develop tools to search them efficiently and assess the statistical significance of chemical similarity scores. Here, we develop a general framework for understanding, modeling, predicting, and approximating the distribution of chemical similarity scores and its extreme values in large databases. The framework can be applied to different chemical representations and similarity measures but is demonstrated here using the most common binary fingerprints with the Tanimoto similarity measure. After introducing several probabilistic models of fingerprints, including the Conditional Gaussian Uniform model, we show that the distribution of Tanimoto scores can be approximated by the distribution of the ratio of two correlated Normal random variables associated with the corresponding unions and intersections. This remains true also when the distribution of similarity scores is conditioned on the size of the query molecules to derive more fine-grained results and improve chemical retrieval. The corresponding extreme value distributions for the maximum scores are approximated by Weibull distributions. From these various distributions and their analytical forms, Z-scores, E-values, and p-values are derived to assess the significance of similarity scores. In addition, the framework also allows one to predict the value of standard chemical retrieval metrics, such as sensitivity and specificity at fixed thresholds, or receiver operating characteristic (ROC) curves at multiple thresholds, and to detect outliers in the form of atypical molecules. Numerous and diverse experiments that have been performed, in part with large sets of molecules from the ChemDB, show remarkable agreement between theory and empirical results.
随着化学分子数据库的不断扩展和开放,开发高效的搜索工具并评估化学相似性得分的统计显著性变得越来越重要。在这里,我们开发了一个通用框架,用于理解、建模、预测和近似大型数据库中化学相似性得分及其极值的分布。该框架可应用于不同的化学表示和相似性度量,但在这里使用最常见的二进制指纹和 Tanimoto 相似性度量进行了演示。在介绍了几种指纹的概率模型,包括条件高斯均匀模型之后,我们表明 Tanimoto 得分的分布可以通过与相应并集和交集相关联的两个相关正态随机变量的比值的分布来近似。即使在相似性得分的分布被条件化到查询分子的大小上以得出更细粒度的结果和改进化学检索时,这也是正确的。对于最大得分的相应极值分布,可以通过 Weibull 分布来近似。从这些各种分布及其分析形式,可以推导出 Z 分数、E 值和 p 值来评估相似性得分的显著性。此外,该框架还允许预测标准化学检索指标的值,例如在固定阈值下的敏感性和特异性,或在多个阈值下的接收者操作特征 (ROC) 曲线,并以非典型分子的形式检测异常值。已经进行了大量和多样化的实验,部分实验使用了 ChemDB 中的大型分子集,理论和经验结果之间存在显著的一致性。