Suppr超能文献

整理原子顺序——一种新颖且强大的分子正则化算法的开源实现。

Get Your Atoms in Order--An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm.

机构信息

Novartis Institutes for BioMedical Research, Novartis Pharma AG , Novartis Campus, CH-4002 Basel, Switzerland.

NextMove Software Ltd. , Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge CB4 0EY, U.K.

出版信息

J Chem Inf Model. 2015 Oct 26;55(10):2111-20. doi: 10.1021/acs.jcim.5b00543. Epub 2015 Oct 15.

Abstract

Finding a canonical ordering of the atoms in a molecule is a prerequisite for generating a unique representation of the molecule. The canonicalization of a molecule is usually accomplished by applying some sort of graph relaxation algorithm, the most common of which is the Morgan algorithm. There are known issues with that algorithm that lead to noncanonical atom orderings as well as problems when it is applied to large molecules like proteins. Furthermore, each cheminformatics toolkit or software provides its own version of a canonical ordering, most based on unpublished algorithms, which also complicates the generation of a universal unique identifier for molecules. We present an alternative canonicalization approach that uses a standard stable-sorting algorithm instead of a Morgan-like index. Two new invariants that allow canonical ordering of molecules with dependent chirality as well as those with highly symmetrical cyclic graphs have been developed. The new approach proved to be robust and fast when tested on the 1.45 million compounds of the ChEMBL 20 data set in different scenarios like random renumbering of input atoms or SMILES round tripping. Our new algorithm is able to generate a canonical order of the atoms of protein molecules within a few milliseconds. The novel algorithm is implemented in the open-source cheminformatics toolkit RDKit. With this paper, we provide a reference Python implementation of the algorithm that could easily be integrated in any cheminformatics toolkit. This provides a first step toward a common standard for canonical atom ordering to generate a universal unique identifier for molecules other than InChI.

摘要

确定分子中原子的规范顺序是生成分子独特表示的前提。分子的规范通常通过应用某种图松弛算法来完成,其中最常见的是摩根算法。该算法存在已知问题,会导致非规范的原子排序,并且在应用于蛋白质等大型分子时也会出现问题。此外,每个化学信息学工具包或软件都提供自己版本的规范排序,其中大多数基于未发表的算法,这也增加了为分子生成通用唯一标识符的复杂性。我们提出了一种替代的规范方法,该方法使用标准的稳定排序算法而不是类似摩根的索引。已经开发了两个新不变量,允许对具有依赖手性的分子以及具有高度对称环状图的分子进行规范排序。在不同场景下(例如输入原子的随机重编号或 SMILES 往返)对 ChEMBL 20 数据集的 145 万个化合物进行测试时,新方法被证明是稳健和快速的。新算法能够在几毫秒内生成蛋白质分子的原子规范顺序。该新算法已在开源化学信息学工具包 RDKit 中实现。通过本文,我们提供了算法的参考 Python 实现,该实现可以轻松集成到任何化学信息学工具包中。这为生成除 InChI 之外的分子的通用规范原子排序标准迈出了第一步,以生成通用唯一标识符。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验