Department of Medical Informatics, Erasmus University Medical Center, P,O, Box 2040, Rotterdam, CA, 3000, Netherlands.
J Cheminform. 2012 Dec 13;4(1):35. doi: 10.1186/1758-2946-4-35.
Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.
The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).
We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.
公共和商业化学数据库中的结构和相关元数据的正确性极大地影响了药物发现研究活动,如定量构效关系建模和化合物新颖性检查。MOL 文件、SMILES 符号、IUPAC 名称和 InChI 字符串是化学结构的普遍文件格式和系统标识符。虽然在许多化学信息学用途中可以互换,但由于数据集成的各种方法,包括使用不同的软件和不同的结构标准化规则,这些结构标识符的不一致性尚未进行研究。我们研究了一些常用化学资源中,小分子的系统标识符在有和没有结构标准化的情况下在内部和之间的一致性。
系统化学标识符与其相应的 MOL 表示之间的一致性在数据源之间差异很大(37.2%-98.5%)。我们观察到 MOL-IUPAC 名称的总体一致性最低。忽略立体化学可提高一致性(84.8%至 99.9%)。通过交叉引用链接的化合物的 MOL 表示之间也存在很大的一致性差异(25.8%至 93.7%)。去除立体化学可提高一致性(47.6%至 95.6%)。
我们表明,在数据库内部和之间,结构表示和系统化学标识符存在相当大的不一致性。如果在合并数据时使用系统标识符作为结构集成或跨查询几个数据库的关键索引,这可能会产生很大的影响。从 MOL 表示开始重新生成系统标识符,并在创建之前对所有化合物应用定义明确且有文件记录的化学标准化规则,可以极大地提高内部一致性。