整理原子顺序——一种新颖且强大的分子正则化算法的开源实现。

Get Your Atoms in Order--An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm.

机构信息

Novartis Institutes for BioMedical Research, Novartis Pharma AG , Novartis Campus, CH-4002 Basel, Switzerland.

NextMove Software Ltd. , Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge CB4 0EY, U.K.

出版信息

J Chem Inf Model. 2015 Oct 26;55(10):2111-20. doi: 10.1021/acs.jcim.5b00543. Epub 2015 Oct 15.

DOI:10.1021/acs.jcim.5b00543

PMID:26441310

Abstract

Finding a canonical ordering of the atoms in a molecule is a prerequisite for generating a unique representation of the molecule. The canonicalization of a molecule is usually accomplished by applying some sort of graph relaxation algorithm, the most common of which is the Morgan algorithm. There are known issues with that algorithm that lead to noncanonical atom orderings as well as problems when it is applied to large molecules like proteins. Furthermore, each cheminformatics toolkit or software provides its own version of a canonical ordering, most based on unpublished algorithms, which also complicates the generation of a universal unique identifier for molecules. We present an alternative canonicalization approach that uses a standard stable-sorting algorithm instead of a Morgan-like index. Two new invariants that allow canonical ordering of molecules with dependent chirality as well as those with highly symmetrical cyclic graphs have been developed. The new approach proved to be robust and fast when tested on the 1.45 million compounds of the ChEMBL 20 data set in different scenarios like random renumbering of input atoms or SMILES round tripping. Our new algorithm is able to generate a canonical order of the atoms of protein molecules within a few milliseconds. The novel algorithm is implemented in the open-source cheminformatics toolkit RDKit. With this paper, we provide a reference Python implementation of the algorithm that could easily be integrated in any cheminformatics toolkit. This provides a first step toward a common standard for canonical atom ordering to generate a universal unique identifier for molecules other than InChI.

摘要

确定分子中原子的规范顺序是生成分子独特表示的前提。分子的规范通常通过应用某种图松弛算法来完成，其中最常见的是摩根算法。该算法存在已知问题，会导致非规范的原子排序，并且在应用于蛋白质等大型分子时也会出现问题。此外，每个化学信息学工具包或软件都提供自己版本的规范排序，其中大多数基于未发表的算法，这也增加了为分子生成通用唯一标识符的复杂性。我们提出了一种替代的规范方法，该方法使用标准的稳定排序算法而不是类似摩根的索引。已经开发了两个新不变量，允许对具有依赖手性的分子以及具有高度对称环状图的分子进行规范排序。在不同场景下（例如输入原子的随机重编号或 SMILES 往返）对 ChEMBL 20 数据集的 145 万个化合物进行测试时，新方法被证明是稳健和快速的。新算法能够在几毫秒内生成蛋白质分子的原子规范顺序。该新算法已在开源化学信息学工具包 RDKit 中实现。通过本文，我们提供了算法的参考 Python 实现，该实现可以轻松集成到任何化学信息学工具包中。这为生成除 InChI 之外的分子的通用规范原子排序标准迈出了第一步，以生成通用唯一标识符。

相似文献

Get Your Atoms in Order--An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm.

J Chem Inf Model. 2015 Oct 26;55(10):2111-20. doi: 10.1021/acs.jcim.5b00543. Epub 2015 Oct 15.

Noncontiguous atom matching structural similarity function.

J Chem Inf Model. 2013 Oct 28;53(10):2511-24. doi: 10.1021/ci400324u. Epub 2013 Oct 8.

Indexing molecules with chemical graph identifiers.

J Comput Chem. 2011 Sep;32(12):2638-46. doi: 10.1002/jcc.21843. Epub 2011 Jun 6.

RDChiral: An RDKit Wrapper for Handling Stereochemistry in Retrosynthetic Template Extraction and Application.

J Chem Inf Model. 2019 Jun 24;59(6):2529-2537. doi: 10.1021/acs.jcim.9b00286. Epub 2019 Jun 13.

RDCanon: A Python Package for Canonicalizing the Order of Tokens in SMARTS Queries.

J Chem Inf Model. 2024 Apr 22;64(8):2948-2954. doi: 10.1021/acs.jcim.4c00138. Epub 2024 Mar 15.

De Novo Molecule Design by Translating from Reduced Graphs to SMILES.

J Chem Inf Model. 2019 Mar 25;59(3):1136-1146. doi: 10.1021/acs.jcim.8b00626. Epub 2018 Dec 21.

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI.

J Cheminform. 2012 Sep 18;4(1):22. doi: 10.1186/1758-2946-4-22.

Atomic ring invariant and Modified CANON extended connectivity algorithm for symmetry perception in molecular graphs and rigorous canonicalization of SMILES.

J Cheminform. 2020 Aug 20;12(1):48. doi: 10.1186/s13321-020-00453-4.

TUCAN: A molecular identifier and descriptor applicable to the whole periodic table from hydrogen to oganesson.

J Cheminform. 2022 Sep 28;14(1):66. doi: 10.1186/s13321-022-00640-5.

Molecular query language (MQL)--a context-free grammar for substructure matching.

J Chem Inf Model. 2007 Mar-Apr;47(2):295-301. doi: 10.1021/ci600305h.

引用本文的文献

CPI-MIF: Compound-Protein Interaction Prediction with Multiview Information Fusion.

ACS Omega. 2025 Jul 13;10(28):30155-30166. doi: 10.1021/acsomega.5c00113. eCollection 2025 Jul 22.

Representation of Molecules by Sequences of Instructions.

J Chem Inf Model. 2025 Aug 11;65(15):7936-7955. doi: 10.1021/acs.jcim.5c00354. Epub 2025 Jul 28.

Graph Convolutional Neural Network-Enabled Frontier Molecular Orbital Prediction: A Case Study with Neurotransmitters and Antidepressants.

J Chem Inf Model. 2025 Jul 28;65(14):7447-7462. doi: 10.1021/acs.jcim.5c00724. Epub 2025 Jul 17.

Deep Supramolecular Language Processing for Co-Crystal Prediction.

Angew Chem Int Ed Engl. 2025 Jul;64(29):e202507835. doi: 10.1002/anie.202507835. Epub 2025 May 30.

Efficient and Explainable Virtual Screening of Molecules through Fingerprint-Generating Networks Integrated with Artificial Neural Networks.

ACS Omega. 2025 Jan 28;10(5):4896-4911. doi: 10.1021/acsomega.4c10289. eCollection 2025 Feb 11.

ChatMol: interactive molecular discovery with natural language.

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae534.

Virtual Screening of Molecules via Neural Fingerprint-based Deep Learning Technique.

Res Sq. 2024 May 9:rs.3.rs-4355625. doi: 10.21203/rs.3.rs-4355625/v1.

Emerging opportunities of using large language models for translation between drug molecules and indications.

Sci Rep. 2024 May 10;14(1):10738. doi: 10.1038/s41598-024-61124-0.

MACE: Automated Assessment of Stereochemistry of Transition Metal Complexes and Its Applications in Computational Catalysis.

J Chem Theory Comput. 2024 Mar 12;20(5):2313-2320. doi: 10.1021/acs.jctc.3c01313. Epub 2024 Feb 16.

Molecular Descriptors Property Prediction Using Transformer-Based Approach.

Int J Mol Sci. 2023 Jul 26;24(15):11948. doi: 10.3390/ijms241511948.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

整理原子顺序——一种新颖且强大的分子正则化算法的开源实现。

Get Your Atoms in Order--An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献