Vogt Martin, Bajorath Jürgen
Department of Life Science Informatics, B-IT, University of Bonn, Endenicher Allee 19c, Bonn, NRW, 53115, Germany.
F1000Res. 2020 Feb 10;9. doi: 10.12688/f1000research.22292.2. eCollection 2020.
The ccbmlib Python package is a collection of modules for modeling similarity value distributions based on Tanimoto coefficients for fingerprints available in RDKit. It can be used to assess the statistical significance of Tanimoto coefficients and evaluate how molecular similarity is reflected when different fingerprint representations are used. Significance measures derived from -values allow a quantitative comparison of similarity scores obtained from different fingerprint representations that might have very different value ranges. Furthermore, the package models conditional distributions of similarity coefficients for a given reference compound. The conditional significance score estimates where a test compound would be ranked in a similarity search. The models are based on the statistical analysis of feature distributions and feature correlations of fingerprints of a reference database. The resulting models have been evaluated for 11 RDKit fingerprints, taking a collection of ChEMBL compounds as a reference data set. For most fingerprints, highly accurate models were obtained, with differences of 1% or less for Tanimoto coefficients indicating high similarity.
ccbmlib Python包是一组模块,用于基于RDKit中可用的指纹的Tanimoto系数对相似性值分布进行建模。它可用于评估Tanimoto系数的统计显著性,并评估在使用不同指纹表示时分子相似性是如何体现的。从p值导出的显著性度量允许对从可能具有非常不同值范围的不同指纹表示获得的相似性分数进行定量比较。此外,该包对给定参考化合物的相似性系数的条件分布进行建模。条件显著性分数估计测试化合物在相似性搜索中的排名。这些模型基于对参考数据库指纹的特征分布和特征相关性的统计分析。以ChEMBL化合物集合作为参考数据集,对11种RDKit指纹的结果模型进行了评估。对于大多数指纹,获得了高度准确的模型,Tanimoto系数的差异为1%或更小,表明相似性很高。