Suppr超能文献

迈向通用 SMILES 表示法——基于 InChI 生成规范 SMILES 的标准方法

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI.

机构信息

Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy Building, University College Cork, Cork, Co, Cork, Ireland.

出版信息

J Cheminform. 2012 Sep 18;4(1):22. doi: 10.1186/1758-2946-4-22.

Abstract

BACKGROUND

There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.

RESULTS

I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.

CONCLUSIONS

The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain - such as the development of a standard aromatic model for SMILES - the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.

摘要

背景

在化学结构领域,有两种已经确立的线式结构表示法:SMILES 字符串和 InChI 字符串。InChI 的目标是为化学结构提供一个独特的或规范的标识符,而 SMILES 字符串则广泛用于化学结构的存储和交换,但没有生成规范 SMILES 字符串的标准。

结果

我描述了如何使用 InChI 规范来直接生成规范的 SMILES 字符串,既可以包含 InChI 标准化(Inchified SMILES),也可以不包含(Universal SMILES)。这是第一个描述考虑立体化学的生成规范 SMILES 方法的描述。在对 ChEMBL 数据库中的 1.1 m 化合物和 PubChem 物质数据库中的 1 m 化合物子集进行测试时,Inchified SMILES 没有发现规范失败。使用 Universal SMILES,成功规范了 ChEMBL 数据库的 99.79%,PubChem 子集的 99.77%。

结论

InChI 规范算法可以成功地用作规范 SMILES 的通用标准的基础。虽然仍然存在挑战——例如为 SMILES 开发标准芳香模型——但使用不同工具包创建相同 SMILES 的能力将意味着,首次可以轻松比较不同工具包使用的化学模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/140d/3495655/756bd7c899ad/1758-2946-4-22-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验