Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, NCI-Frederick, 376 Boyles St., Frederick, MD 21702, USA.
J Comput Aided Mol Des. 2010 Jun;24(6-7):521-51. doi: 10.1007/s10822-010-9346-4. Epub 2010 May 29.
We have used the Chemical Structure DataBase (CSDB) of the NCI CADD Group, an aggregated collection of over 150 small-molecule databases totaling 103.5 million structure records, to conduct tautomerism analyses on one of the largest currently existing sets of real (i.e. not computer-generated) compounds. This analysis was carried out using calculable chemical structure identifiers developed by the NCI CADD Group, based on hash codes available in the chemoinformatics toolkit CACTVS and a newly developed scoring scheme to define a canonical tautomer for any encountered structure. CACTVS's tautomerism definition, a set of 21 transform rules expressed in SMIRKS line notation, was used, which takes a comprehensive stance as to the possible types of tautomeric interconversion included. Tautomerism was found to be possible for more than 2/3 of the unique structures in the CSDB. A total of 680 million tautomers were calculated from, and including, the original structure records. Tautomerism overlap within the same individual database (i.e. at least one other entry was present that was really only a different tautomeric representation of the same compound) was found at an average rate of 0.3% of the original structure records, with values as high as nearly 2% for some of the databases in CSDB. Projected onto the set of unique structures (by FICuS identifier), this still occurred in about 1.5% of the cases. Tautomeric overlap across all constituent databases in CSDB was found for nearly 10% of the records in the collection.
我们使用了 NCI CADD 小组的化学结构数据库 (CSDB),这是一个汇总了超过 150 个小分子数据库的集合,总共包含 1.035 亿个结构记录,对目前存在的最大的一组真实(即不是计算机生成的)化合物之一进行了互变异构分析。这项分析是使用 NCI CADD 小组开发的可计算化学结构标识符进行的,这些标识符基于 chemoinformatics 工具包 CACTVS 中的哈希码和新开发的评分方案,用于为遇到的任何结构定义规范互变异构体。CACTVS 的互变异构定义是一套 21 个用 SMIRKS 线表示的转换规则,它全面考虑了可能的互变异构转化类型。结果发现,CSDB 中超过 2/3 的独特结构都有可能发生互变异构。从原始结构记录中计算并包括了 6.8 亿个互变异构体。在同一个数据库中(即至少还有一个其他条目实际上只是同一化合物的不同互变异构表示),互变异构体之间存在重叠,平均占原始结构记录的 0.3%,而在 CSDB 中的一些数据库中,这一比例高达近 2%。将其投影到唯一结构集(通过 FICuS 标识符)上,这种情况在大约 1.5%的情况下仍然会发生。在 CSDB 的所有组成数据库中,发现了近 10%的记录存在互变异构重叠。