挖掘化学数据库中的片段共现情况：发现“化学陈词滥调”。

Mining a chemical database for fragment co-occurrence: discovery of "chemical clichés".

作者信息

Lameijer Eric-Wubbo, Kok Joost N, Bäck Thomas, Ijzerman Ad P

机构信息

Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research, Leiden University, Einsteinweg 55, 2300 RA Leiden, The Netherlands.

出版信息

J Chem Inf Model. 2006 Mar-Apr;46(2):553-62. doi: 10.1021/ci050370c.

DOI:10.1021/ci050370c

PMID:16562983

Abstract

Nowadays millions of different compounds are known, their structures stored in electronic databases. Analysis of these data could yield valuable insights into the laws of chemistry and the habits of chemists. We have therefore explored the public database of the National Cancer Institute (>250,000 compounds) by pattern searching. We split the molecules of this database into fragments to find out which fragments exist, how frequent they are, and whether the occurrence of one fragment in a molecule is related to the occurrence of another, nonoverlapping fragment. It turns out that some fragments and combinations of fragments are so frequent that they can be called "chemical clichés". We believe that the fragment data can give insight into the chemical space explored so far by synthesis. The lists of fragments and their (co-)occurrences can help create novel chemical compounds by (i) systematically listing the most popular and therefore most easily used substituents and ring systems for synthesizing new compounds, (ii) being an easily accessible repository for rarer fragments suitable for lead compound optimization, and (iii) pointing out some of the yet unexplored parts of chemical space.

摘要

如今，已知的化合物有数百万种，其结构存储在电子数据库中。对这些数据进行分析，可能会对化学规律和化学家的习惯产生有价值的见解。因此，我们通过模式搜索探索了美国国立癌症研究所的公共数据库（超过250,000种化合物）。我们将该数据库中的分子拆分为片段，以找出存在哪些片段、它们的出现频率如何，以及一个片段在一个分子中的出现是否与另一个不重叠片段的出现有关。结果发现，一些片段和片段组合非常常见，以至于可以称之为“化学陈词滥调”。我们认为，片段数据可以深入了解迄今为止通过合成探索的化学空间。片段及其（共）出现的列表可以通过以下方式帮助创建新的化合物：（i）系统地列出用于合成新化合物的最常见且因此最易于使用的取代基和环系统；（ii）作为适合先导化合物优化的稀有片段的易于访问的储存库；（iii）指出化学空间中一些尚未探索的部分。