处处皆有SMILES：过渡金属配合物的结构到SMILES的转换

SMILES all around: structure to SMILES conversion for transition metal complexes.

作者信息

Rasmussen Maria H, Strandgaard Magnus, Seumer Julius, Hemmingsen Laura K, Frei Angelo, Balcells David, Jensen Jan H

机构信息

Department of Chemistry, University of Copenhagen, Copenhagen, Denmark.

Department of Chemistry, University of York, York, UK.

出版信息

J Cheminform. 2025 Apr 28;17(1):63. doi: 10.1186/s13321-025-01008-1.

DOI:10.1186/s13321-025-01008-1

PMID:40296090

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12039060/

Abstract

We present a method for creating RDKit-parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with. Therefore, sanity-checking a method has required manual work. Therefore, we also generate SMILES two other ways; one where ligand charges and TMC connectivity are based on natural bond orbital (NBO) analysis from density functional theory (DFT) calculations utilizing recent work by Kneiding et al. (Digit Discov 2: 618-633, 2023). Another one fixes SMILES available through the Cambridge Structural Database (CSD), making them parsable by RDKit. We compare these three different ways of obtaining SMILES for a subset of the CSD (tmQMg) and find >70% agreement for all three pairs. We utilize these SMILES to make simple molecular fingerprint (FP) and graph-based representations of the molecules to be used in the context of machine learning. Comparing with the graphs made by Kneiding et al. where nodes and edges are featurized with DFT properties, we find that depending on the target property (polarizability, HOMO-LUMO gap or dipole moment) the SMILES based representations can perform equally well. This makes them very suitable as baseline-models. Finally we present a dataset of 227k RDKit parsable SMILES for mononuclear TMCs in the CSD.Scientific contribution We present a method that can create RDKit-parsable SMILES strings of transition metal complexes (TMCs) from Cartesian coordinates and use it to create a dataset of 227k TMC SMILES strings. The RDKit-parsability allows us to generate perform machine learning studies of TMC properties using "standard" molecular representations such as fingerprints and 2D-graph convolution. We show that these relatively simple representations can perform quite well depending on the target property.

摘要

我们提出了一种基于过渡金属配合物（TMC）的xyz坐标和总电荷来创建可被RDKit解析的SMILES的方法。这可以看作是对程序xyz2mol的扩展，xyz2mol对有机分子执行相同的操作。唯一的依赖项是RDKit，这使得它具有广泛的适用性。在从结构生成TMC的SMILES方面，一直缺少一个可供比较的现有SMILES数据集。因此，对一种方法进行合理性检查需要人工操作。因此，我们还通过其他两种方式生成SMILES；一种方式是，配体电荷和TMC连接性基于利用Kneiding等人（《数字发现》2：618 - 633，2023年）近期工作通过密度泛函理论（DFT）计算得到的自然键轨道（NBO）分析。另一种方式是修正通过剑桥结构数据库（CSD）获得的SMILES，使其可被RDKit解析。我们比较了为CSD的一个子集（tmQMg）获取SMILES的这三种不同方式，发现所有三对方式的一致性都超过70%。我们利用这些SMILES来制作简单的分子指纹（FP）和基于图的分子表示，以便在机器学习中使用。与Kneiding等人制作的图进行比较，在那些图中节点和边用DFT属性进行了特征化，我们发现根据目标属性（极化率、HOMO - LUMO能隙或偶极矩），基于SMILES的表示可以表现得同样出色。这使得它们非常适合作为基线模型。最后，我们展示了一个包含227k个可被RDKit解析的CSD中单核TMC的SMILES的数据集。

科学贡献

我们提出了一种方法，该方法可以从笛卡尔坐标创建过渡金属配合物（TMC）的可被RDKit解析的SMILES字符串，并使用它来创建一个包含227k个TMC SMILES字符串的数据集。RDKit的可解析性使我们能够使用“标准”分子表示（如指纹和二维图卷积）对TMC属性进行机器学习研究。我们表明，根据目标属性，这些相对简单的表示可以表现得相当好。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0bee/12039060/cccaa29029e8/13321_2025_1008_Fig1_HTML.jpg

相似文献

SMILES all around: structure to SMILES conversion for transition metal complexes.处处皆有SMILES：过渡金属配合物的结构到SMILES的转换

J Cheminform. 2025 Apr 28;17(1):63. doi: 10.1186/s13321-025-01008-1.

tmQM Dataset-Quantum Geometries and Properties of 86k Transition Metal Complexes.tmQM 数据集-86k 过渡金属配合物的量子几何和性质。

J Chem Inf Model. 2020 Dec 28;60(12):6135-6146. doi: 10.1021/acs.jcim.0c01041. Epub 2020 Nov 9.

Exploiting Ligand Additivity for Transferable Machine Learning of Multireference Character across Known Transition Metal Complex Ligands.利用配体加和性实现已知过渡金属配合物配体的多参考态特征的可迁移机器学习。

J Chem Theory Comput. 2022 Aug 9;18(8):4836-4845. doi: 10.1021/acs.jctc.2c00468. Epub 2022 Jul 14.

De Novo Molecule Design by Translating from Reduced Graphs to SMILES.从头设计分子：从简化图到 SMILES 的转换。

J Chem Inf Model. 2019 Mar 25;59(3):1136-1146. doi: 10.1021/acs.jcim.8b00626. Epub 2018 Dec 21.

Reconstruction of lossless molecular representations from fingerprints.从指纹重建无损分子表示。

J Cheminform. 2023 Feb 23;15(1):26. doi: 10.1186/s13321-023-00693-0.

EFGs: A Complete and Accurate Implementation of Ertl's Functional Group Detection Algorithm in RDKit.EFGs：在RDKit中对Ertl功能团检测算法的完整准确实现。

J Chem Inf Model. 2025 Feb 10;65(3):1061-1066. doi: 10.1021/acs.jcim.4c02268. Epub 2025 Jan 28.

Convolutional neural network based on SMILES representation of compounds for detecting chemical motif.基于化合物 SMILES 表示的卷积神经网络用于检测化学基序。

BMC Bioinformatics. 2018 Dec 31;19(Suppl 19):526. doi: 10.1186/s12859-018-2523-5.

Positional embeddings and zero-shot learning using BERT for molecular-property prediction.使用BERT进行位置嵌入和零样本学习以预测分子性质

J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9.

Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error.分子机器学习模型的预测误差低于混合密度泛函理论误差。

J Chem Theory Comput. 2017 Nov 14;13(11):5255-5264. doi: 10.1021/acs.jctc.7b00577. Epub 2017 Oct 10.

Large-scale comparison of 3d and 4d transition metal complexes illuminates the reduced effect of exchange on second-row spin-state energetics.三维和四维过渡金属配合物的大规模比较揭示了交换对第二周期自旋态能量学影响的减弱。

Phys Chem Chem Phys. 2020 Sep 8;22(34):19326-19341. doi: 10.1039/d0cp02977g.

引用本文的文献

A Deep Generative Model for the Inverse Design of Transition Metal Ligands and Complexes.用于过渡金属配体和配合物逆向设计的深度生成模型

JACS Au. 2025 Apr 23;5(5):2294-2308. doi: 10.1021/jacsau.5c00242. eCollection 2025 May 26.

本文引用的文献

Discovery of molybdenum based nitrogen fixation catalysts with genetic algorithms.基于遗传算法的钼基固氮催化剂的发现

Chem Sci. 2024 Jun 7;15(27):10638-10650. doi: 10.1039/d4sc02227k. eCollection 2024 Jul 10.

Directional multiobjective optimization of metal complexes at the billion-system scale.十亿系统规模下金属配合物的定向多目标优化

Nat Comput Sci. 2024 Apr;4(4):263-273. doi: 10.1038/s43588-024-00616-5. Epub 2024 Mar 29.

Molecule auto-correction to facilitate molecular design.分子自动纠错以促进分子设计。

J Comput Aided Mol Des. 2024 Feb 16;38(1):10. doi: 10.1007/s10822-024-00549-1.

Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants' Activities and Properties.基于计数的摩根指纹：在开发用于预测水中污染物活性和性质的基于机器学习的回归模型中，一种更高效且可解释的分子表示方法。

Environ Sci Technol. 2023 Nov 21;57(46):18193-18202. doi: 10.1021/acs.est.3c02198. Epub 2023 Jul 5.

tmQM Dataset-Quantum Geometries and Properties of 86k Transition Metal Complexes.tmQM 数据集-86k 过渡金属配合物的量子几何和性质。

J Chem Inf Model. 2020 Dec 28;60(12):6135-6146. doi: 10.1021/acs.jcim.0c01041. Epub 2020 Nov 9.

The Synthesizability of Molecules Proposed by Generative Models.生成式模型提出的分子可合成性。

J Chem Inf Model. 2020 Dec 28;60(12):5714-5723. doi: 10.1021/acs.jcim.0c00174. Epub 2020 Apr 17.

Oxidation State 10 Exists.十价态氧存在。

Angew Chem Int Ed Engl. 2016 Jul 25;55(31):9004-6. doi: 10.1002/anie.201604670. Epub 2016 Jun 8.

The Cambridge Structural Database.剑桥结构数据库。

Acta Crystallogr B Struct Sci Cryst Eng Mater. 2016 Apr;72(Pt 2):171-9. doi: 10.1107/S2052520616003954. Epub 2016 Apr 1.

Open Babel: An open chemical toolbox.Open Babel：一个开放的化学工具箱。

J Cheminform. 2011 Oct 7;3:33. doi: 10.1186/1758-2946-3-33.

Extended-connectivity fingerprints.扩展连接指纹。

J Chem Inf Model. 2010 May 24;50(5):742-54. doi: 10.1021/ci100050t.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

处处皆有SMILES：过渡金属配合物的结构到SMILES的转换

SMILES all around: structure to SMILES conversion for transition metal complexes.

作者信息

机构信息

出版信息

科学贡献

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献