Sánchez-Cruz Norberto, Medina-Franco José L
Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, 04510, Mexico City, Mexico.
J Cheminform. 2018 Nov 22;10(1):55. doi: 10.1186/s13321-018-0311-x.
Simplified representation of compound databases has several applications in cheminformatics. Herein, we introduce an alternative and general method to build single fingerprint representations of compound databases. The approach is inspired on the previously published modal fingerprints that are aimed to capture the most significant bits of a fingerprint representation for a compound data set. The novelty of the herein proposed statistical-based database fingerprint (SB-DFP) is that it is generated based on binomial proportions comparisons taking as reference the distribution of "1" bits on a large representative set of the chemical space.
To illustrate the Method, SB-DFPs were constructed for 28 epigenetic target data sets retrieved from a recently published epigenomics database of interest in probe and drug discovery. For each target data set, the SB-DFPs were built based on two representative fingerprints of different design using as reference a data set with more than 15 million compounds from ZINC. The application of SB-DFP was illustrated and compared to other methods through association relationships of the 28 epigenetic data sets and similarity searching. It was found that SB-DFPs captured overall, the common features between data sets and the distinct features of each set. In similarity searching SB-DFP equaled or outperformed other approaches for at least 20 out of the 28 sets.
SB-DFP is a general approach based on binomial proportion comparisons to represent a compound data set with a single fingerprint. SB-DFP can be developed, at least in principle, based on any fingerprint and reference data set. SB-DFP is a good alternative for exploration of relationships between targets through its associated compound data sets and performing similarity searching.
化合物数据库的简化表示在化学信息学中有多种应用。在此,我们介绍一种构建化合物数据库单指纹表示的替代通用方法。该方法的灵感来源于先前发表的模态指纹,其旨在捕捉化合物数据集指纹表示中最重要的位。本文提出的基于统计的数据库指纹(SB-DFP)的新颖之处在于,它是基于二项式比例比较生成的,以化学空间的大型代表性集合上“1”位的分布为参考。
为说明该方法,针对从最近发表的一个与探针和药物发现相关的表观基因组学数据库中检索到的28个表观遗传靶点数据集构建了SB-DFP。对于每个靶点数据集,基于两种不同设计的代表性指纹构建SB-DFP,以来自ZINC的超过1500万种化合物的数据集作为参考。通过28个表观遗传数据集的关联关系和相似性搜索说明了SB-DFP的应用,并与其他方法进行了比较。发现SB-DFP总体上捕捉到了数据集之间的共同特征以及每个数据集的独特特征。在相似性搜索中,SB-DFP在28个数据集中至少有20个等于或优于其他方法。
SB-DFP是一种基于二项式比例比较的通用方法,用于用单个指纹表示化合物数据集。至少在原则上,SB-DFP可以基于任何指纹和参考数据集来开发。SB-DFP是通过其相关的化合物数据集探索靶点之间关系以及进行相似性搜索的良好替代方法。