IUPHAR/BPS Guide to PHARMACOLOGY, Deanery of Biomedical Sciences, University of Edinburgh, Edinburgh, EH8 9XD, UK.
ChemMedChem. 2018 Mar 20;13(6):470-481. doi: 10.1002/cmdc.201700724. Epub 2018 Feb 23.
The three databases of PubChem, ChemSpider, and UniChem capture the majority of open chemical structure records with February 2018 totals of 95, 63, and 154 million, respectively. Collectively, they constitute a massively enabling resource for cheminformatics, chemical biology, and drug discovery. As meta-portals, they subsume and link out to the major proportion of public bioactivity data extracted from the literature and screening center assay results. Therefore, they not only present three different entry points, but the many subsumed independent resources present a fourth entry point in the form of standalone databases. Because this creates a complex picture it is important for users to have at least some appreciation of differential content to enable utility judgments for the tasks at hand. This turns out to be challenging. By comparing the three resources in detail, this review assesses their differences, some of which are not obvious. This includes the fact that coverage is significantly different between the 587, 282, and 38 contributing sources, respectively. This not only presents the "who-has-what" question, but also the reason "why" any particular inclusion is considered valuable is rarely made explicit. Also confusing is that sources nominally in common (i.e., having the same submitter name) can have significantly different structure counts, not only in each of the three but also from their standalone instantiations. Assessing a series of examples indicates that differences in loading dates and structural standardization are the main causes of this inter-portal discordance.
PubChem、ChemSpider 和 UniChem 这三个数据库分别收录了 9500 万、630 万和 1.54 亿个开放的化学结构记录,截至 2018 年 2 月,这三个数据库占据了大多数的化学结构记录。它们共同构成了化学信息学、化学生物学和药物发现的一个大规模启用资源。作为元门户,它们包含并链接了从文献和筛选中心测定结果中提取的大部分公共生物活性数据。因此,它们不仅提供了三个不同的切入点,而且许多包含的独立资源以独立数据库的形式提供了第四个切入点。由于这造成了一个复杂的情况,用户至少需要对差异内容有一定的了解,以便对当前任务的实用性进行判断。事实证明,这具有一定的挑战性。通过详细比较这三个资源,本综述评估了它们之间的差异,其中一些差异并不明显。这包括以下事实:这三个资源的收录范围分别存在显著差异,分别为 587、282 和 38 个贡献源。这不仅提出了“谁有什么”的问题,而且为什么任何特定的收录都被认为是有价值的原因也很少被明确说明。同样令人困惑的是,名义上相同的来源(即具有相同的提交者名称)的结构数量可能存在显著差异,不仅在这三个数据库中如此,而且在它们的独立实例中也是如此。评估一系列示例表明,加载日期和结构标准化方面的差异是导致这种门户之间不一致的主要原因。