Fernández-de Gortari Eli, García-Jacas César R, Martinez-Mayorga Karina, Medina-Franco José L
Departamento de Farmacia, Facultad de Química, Universidad Nacional Autónoma de México, Avenida Universidad 3000, 04510 Mexico City, Mexico.
Instituto de Química, Universidad Nacional Autónoma de México, Avenida Universidad 3000, 04510 Mexico City, Mexico.
J Cheminform. 2017 Feb 6;9:9. doi: 10.1186/s13321-017-0195-1. eCollection 2017.
Molecular fingerprints are widely used in several areas of chemoinformatics including diversity analysis and similarity searching. The fingerprint-based analysis of chemical libraries, in particular of large collections, usually requires the molecular representation of each compound in the library that may lead to issues of storage space and redundant calculations. In fact, information redundancy is inherent to the data, resulting on binary digit positions in the fingerprint without significant information.
Herein is proposed a general approach to represent an entire compound library with a single binary fingerprint. The development of the database fingerprint (DFP) is illustrated first using a short fingerprint (MACCS keys) for 10 data sets of general interest in chemistry. The application of the DFP is further shown with PubChem fingerprints for the data sets used in the primary example but with a larger number of compounds, up to 25,000 molecules. The performance of DFP were studied through differential Shannon entropy, k-mean clustering, and DFP/Tanimoto similarity.
The DFP is designed to capture key information of the compound collection and can be used to compare and assess the diversity of molecular libraries. This Preliminary Communication shows the potential of the novel fingerprint to conduct inter-library relationships. A major future goal is to apply the DFP for virtual screening and developing DFP for other data sets based on several different type of fingerprints.Graphical AbstractDatabase fingerprint captures the key information of molecular databases to perform chemical space characterization and virtual screening.
分子指纹在化学信息学的多个领域广泛应用,包括多样性分析和相似性搜索。基于指纹对化学库(尤其是大型库)进行分析时,通常需要库中每个化合物的分子表示形式,这可能会导致存储空间和冗余计算问题。实际上,信息冗余是数据固有的,导致指纹中的二进制数字位置没有重要信息。
本文提出了一种用单个二进制指纹表示整个化合物库的通用方法。首先使用短指纹(MACCS键)对10个化学领域普遍关注的数据集说明数据库指纹(DFP)的开发。对于主要示例中使用的数据集,但化合物数量更多(多达25000个分子),进一步展示了DFP与PubChem指纹的应用。通过差分香农熵、k均值聚类和DFP/Tanimoto相似性研究了DFP的性能。
DFP旨在捕获化合物集合的关键信息,可用于比较和评估分子库的多样性。本初步通讯展示了这种新型指纹在建立库间关系方面的潜力。未来的一个主要目标是将DFP应用于虚拟筛选,并基于几种不同类型的指纹为其他数据集开发DFP。
数据库指纹捕获分子数据库的关键信息以进行化学空间表征和虚拟筛选。