小分子数据库内及数据库间系统化学标识符的一致性。

Consistency of systematic chemical identifiers within and between small-molecule databases.

机构信息

Department of Medical Informatics, Erasmus University Medical Center, P,O, Box 2040, Rotterdam, CA, 3000, Netherlands.

出版信息

J Cheminform. 2012 Dec 13;4(1):35. doi: 10.1186/1758-2946-4-35.

DOI:10.1186/1758-2946-4-35

PMID:23237381

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3539895/

Abstract

BACKGROUND

Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.

RESULTS

The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).

CONCLUSIONS

We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.

摘要

背景

公共和商业化学数据库中的结构和相关元数据的正确性极大地影响了药物发现研究活动，如定量构效关系建模和化合物新颖性检查。MOL 文件、SMILES 符号、IUPAC 名称和 InChI 字符串是化学结构的普遍文件格式和系统标识符。虽然在许多化学信息学用途中可以互换，但由于数据集成的各种方法，包括使用不同的软件和不同的结构标准化规则，这些结构标识符的不一致性尚未进行研究。我们研究了一些常用化学资源中，小分子的系统标识符在有和没有结构标准化的情况下在内部和之间的一致性。

结果

系统化学标识符与其相应的 MOL 表示之间的一致性在数据源之间差异很大（37.2%-98.5%）。我们观察到 MOL-IUPAC 名称的总体一致性最低。忽略立体化学可提高一致性（84.8%至 99.9%）。通过交叉引用链接的化合物的 MOL 表示之间也存在很大的一致性差异（25.8%至 93.7%）。去除立体化学可提高一致性（47.6%至 95.6%）。

结论

我们表明，在数据库内部和之间，结构表示和系统化学标识符存在相当大的不一致性。如果在合并数据时使用系统标识符作为结构集成或跨查询几个数据库的关键索引，这可能会产生很大的影响。从 MOL 表示开始重新生成系统标识符，并在创建之前对所有化合物应用定义明确且有文件记录的化学标准化规则，可以极大地提高内部一致性。

相似文献

Consistency of systematic chemical identifiers within and between small-molecule databases.小分子数据库内及数据库间系统化学标识符的一致性。

J Cheminform. 2012 Dec 13;4(1):35. doi: 10.1186/1758-2946-4-35.

Ambiguity of non-systematic chemical identifiers within and between small-molecule databases.小分子数据库内部及之间非系统化学标识符的模糊性。

J Cheminform. 2015 Nov 16;7:54. doi: 10.1186/s13321-015-0102-6. eCollection 2015.

On InChI and evaluating the quality of cross-reference links.关于 InChI 和交叉引用链接质量的评估。

J Cheminform. 2014 Apr 17;6:15. doi: 10.1186/1758-2946-6-15. eCollection 2014.

Unique identifiers for small molecules enable rigorous labeling of their atoms.小分子的独特标识符可实现其原子的严格标记。

Sci Data. 2017 May 23;4:170073. doi: 10.1038/sdata.2017.73.

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI.迈向通用 SMILES 表示法——基于 InChI 生成规范 SMILES 的标准方法

J Cheminform. 2012 Sep 18;4(1):22. doi: 10.1186/1758-2946-4-22.

Automated evaluation of consistency within the PubChem Compound database.自动评估 PubChem 化合物数据库中的一致性。

Sci Data. 2019 Feb 19;6:190023. doi: 10.1038/sdata.2019.23.

The Chemical Translation Service--a web-based tool to improve standardization of metabolomic reports.化学翻译服务——一种基于网络的工具，用于提高代谢组学报告的标准化。

Bioinformatics. 2010 Oct 15;26(20):2647-8. doi: 10.1093/bioinformatics/btq476. Epub 2010 Sep 9.

NAOMI: on the almost trivial task of reading molecules from different file formats.NAOMI：在从不同文件格式读取分子的几乎微不足道的任务上。

J Chem Inf Model. 2011 Dec 27;51(12):3199-207. doi: 10.1021/ci200324e. Epub 2011 Dec 1.

STOUT: SMILES to IUPAC names using neural machine translation.STOUT：使用神经机器翻译将SMILES转换为IUPAC名称。

J Cheminform. 2021 Apr 27;13(1):34. doi: 10.1186/s13321-021-00512-4.

Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling.用于化学结构自动标准化以支持定量构效关系建模的免费开源且适用于定量构效关系的工作流程。

J Cheminform. 2024 Feb 20;16(1):19. doi: 10.1186/s13321-024-00814-3.

引用本文的文献

Representation of Molecules by Sequences of Instructions.通过指令序列对分子进行表示。

J Chem Inf Model. 2025 Aug 11;65(15):7936-7955. doi: 10.1021/acs.jcim.5c00354. Epub 2025 Jul 28.

Pattern recognition-based analysis of the material basis of five flavors of Chinese herbal medicines in Lamiaceae.基于模式识别的唇形科中药五味物质基础分析

J Tradit Chin Med. 2025 Jun;45(3):597-609. doi: 10.19852/j.cnki.jtcm.2025.03.014.

PubChem synonym filtering process using crowdsourcing.使用众包的PubChem同义词筛选过程。

J Cheminform. 2024 Jun 16;16(1):69. doi: 10.1186/s13321-024-00868-3.

canSAR chemistry registration and standardization pipeline.癌症小分子活性数据库化学登记与标准化流程

J Cheminform. 2022 May 28;14(1):28. doi: 10.1186/s13321-022-00606-7.

ChemProps: A RESTful API enabled database for composite polymer name standardization.化学属性：一个启用了RESTful API的用于复合聚合物名称标准化的数据库。

J Cheminform. 2021 Mar 12;13(1):22. doi: 10.1186/s13321-021-00502-6.

Schema Matching and Data Integration with Consistent Naming on Protein Crystallization Screens.利用蛋白质结晶筛选实验中一致的命名进行模式匹配和数据集成。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Nov-Dec;17(6):2074-2085. doi: 10.1109/TCBB.2019.2913368. Epub 2020 Dec 8.

Consistency, Inconsistency, and Ambiguity of Metabolite Names in Biochemical Databases Used for Genome-Scale Metabolic Modelling.用于基因组规模代谢建模的生化数据库中代谢物名称的一致性、不一致性和模糊性

Metabolites. 2019 Feb 6;9(2):28. doi: 10.3390/metabo9020028.

Automatic identification of relevant chemical compounds from patents.从专利中自动识别相关化合物。

Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.

PubChem chemical structure standardization.PubChem化学结构标准化

J Cheminform. 2018 Aug 10;10(1):36. doi: 10.1186/s13321-018-0293-8.

Computational approaches to chemical hazard assessment.计算方法在化学危害评估中的应用。

ALTEX. 2017;34(4):459-478. doi: 10.14573/altex.1710141.

本文引用的文献

Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.文本挖掘在药物和化学化合物中的应用：方法、工具和应用。

Mol Inform. 2011 Jun;30(6-7):506-19. doi: 10.1002/minf.201100005. Epub 2011 Jul 12.

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI.迈向通用 SMILES 表示法——基于 InChI 生成规范 SMILES 的标准方法

J Cheminform. 2012 Sep 18;4(1):22. doi: 10.1186/1758-2946-4-22.

Mapping between databases of compounds and protein targets.化合物数据库与蛋白质靶点之间的映射。

Methods Mol Biol. 2012;910:145-64. doi: 10.1007/978-1-61779-965-5_8.

Building an R&D chemical registration system.建立研发用化学物质登记制度。

J Cheminform. 2012 May 31;4(1):11. doi: 10.1186/1758-2946-4-11.

Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation.迈向黄金标准：关于公共领域化学数据库的质量以及改善现状的方法。

Drug Discov Today. 2012 Jul;17(13-14):685-701. doi: 10.1016/j.drudis.2012.02.013. Epub 2012 Mar 8.

Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data.充分利用每一个 SAR 点：为大规模整合结构和生物活性数据而开发的 Chemistry Connect。

Drug Discov Today. 2011 Dec;16(23-24):1019-30. doi: 10.1016/j.drudis.2011.10.005. Epub 2011 Oct 14.

ChEMBL: a large-scale bioactivity database for drug discovery.ChEMBL：用于药物发现的大型生物活性数据库。

Nucleic Acids Res. 2012 Jan;40(Database issue):D1100-7. doi: 10.1093/nar/gkr777. Epub 2011 Sep 23.

A quality alert and call for improved curation of public chemistry databases.质量警示和呼吁改进公共化学数据库的管理。

Drug Discov Today. 2011 Sep;16(17-18):747-50. doi: 10.1016/j.drudis.2011.07.007. Epub 2011 Jul 30.

The NCGC pharmaceutical collection: a comprehensive resource of clinically approved drugs enabling repurposing and chemical genomics.NCGC 药物库：一个全面的临床批准药物资源，可用于药物重定位和化学生物基因组学。

Sci Transl Med. 2011 Apr 27;3(80):80ps16. doi: 10.1126/scitranslmed.3001862.

Chemical name to structure: OPSIN, an open source solution.化学名到结构：视蛋白，一个开源解决方案。

J Chem Inf Model. 2011 Mar 28;51(3):739-53. doi: 10.1021/ci100384d. Epub 2011 Mar 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

小分子数据库内及数据库间系统化学标识符的一致性。

Consistency of systematic chemical identifiers within and between small-molecule databases.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献