化学数据库中的分子多样性：药物化学知识库与市售化合物数据库的比较。

Molecular diversity in chemical databases: comparison of medicinal chemistry knowledge bases and databases of commercially available compounds.

作者信息

Cummins D J, Andrews C W, Bentley J A, Cory M

机构信息

Division of Medicinal Chemistry, Glaxo Wellcome, Research Triangle Park, North Carolina 27709, USA.

出版信息

J Chem Inf Comput Sci. 1996 Jul-Aug;36(4):750-63. doi: 10.1021/ci950168h.

DOI:10.1021/ci950168h

PMID:8768767

Abstract

A molecular descriptor space has been developed which describes structural diversity. Large databases of molecules have been mapped into it and compared. This analysis used five chemical databases, CMC and MDDR, which represent knowledge bases containing active medicinal agents, ACD and SPECS, two databases of commercially available compounds, and finally the Wellcome Registry. Together these databases contained more than 300,000 structures. Topological indices and the free energy of solvation were computed for each compound in the databases. Factor analysis was used to reduce the dimensionality of the descriptor space. Low density observations were deleted as a way of removing outliers, which allowed a further reduction in the descriptor space of interest. The five databases could then be compared on an efficient basis using a metric developed for this purpose. A Riemann gridding scheme was used to subdivide the factor space into subhypercubes to obtain accurate comparisons. Most of the 300,000 structures were highly clustered, but unique structures were found. An analysis of overlap between the biological and commercial databases was carried out. The metric provides a useful algorithm for choosing screening sets of diverse compounds from large databases.

摘要

已经开发了一种描述结构多样性的分子描述符空间。大量分子数据库已被映射到该空间并进行比较。该分析使用了五个化学数据库，即代表含有活性药物的知识库的CMC和MDDR、两个市售化合物数据库ACD和SPECS，以及最后一个威康登记库。这些数据库总共包含超过300,000个结构。为数据库中的每个化合物计算拓扑指数和溶剂化自由能。使用因子分析来降低描述符空间的维度。删除低密度观测值作为去除异常值的一种方法，这使得感兴趣的描述符空间进一步减小。然后可以使用为此目的开发的度量在有效基础上比较这五个数据库。使用黎曼网格化方案将因子空间细分为子超立方体以获得准确的比较。300,000个结构中的大多数高度聚集，但也发现了独特的结构。对生物数据库和商业数据库之间的重叠进行了分析。该度量为从大型数据库中选择不同化合物的筛选集提供了一种有用的算法。