Suppr超能文献

小分子数据库内及数据库间系统化学标识符的一致性。

Consistency of systematic chemical identifiers within and between small-molecule databases.

机构信息

Department of Medical Informatics, Erasmus University Medical Center, P,O, Box 2040, Rotterdam, CA, 3000, Netherlands.

出版信息

J Cheminform. 2012 Dec 13;4(1):35. doi: 10.1186/1758-2946-4-35.

Abstract

BACKGROUND

Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.

RESULTS

The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).

CONCLUSIONS

We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.

摘要

背景

公共和商业化学数据库中的结构和相关元数据的正确性极大地影响了药物发现研究活动,如定量构效关系建模和化合物新颖性检查。MOL 文件、SMILES 符号、IUPAC 名称和 InChI 字符串是化学结构的普遍文件格式和系统标识符。虽然在许多化学信息学用途中可以互换,但由于数据集成的各种方法,包括使用不同的软件和不同的结构标准化规则,这些结构标识符的不一致性尚未进行研究。我们研究了一些常用化学资源中,小分子的系统标识符在有和没有结构标准化的情况下在内部和之间的一致性。

结果

系统化学标识符与其相应的 MOL 表示之间的一致性在数据源之间差异很大(37.2%-98.5%)。我们观察到 MOL-IUPAC 名称的总体一致性最低。忽略立体化学可提高一致性(84.8%至 99.9%)。通过交叉引用链接的化合物的 MOL 表示之间也存在很大的一致性差异(25.8%至 93.7%)。去除立体化学可提高一致性(47.6%至 95.6%)。

结论

我们表明,在数据库内部和之间,结构表示和系统化学标识符存在相当大的不一致性。如果在合并数据时使用系统标识符作为结构集成或跨查询几个数据库的关键索引,这可能会产生很大的影响。从 MOL 表示开始重新生成系统标识符,并在创建之前对所有化合物应用定义明确且有文件记录的化学标准化规则,可以极大地提高内部一致性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验