Suppr超能文献

处处皆有SMILES:过渡金属配合物的结构到SMILES的转换

SMILES all around: structure to SMILES conversion for transition metal complexes.

作者信息

Rasmussen Maria H, Strandgaard Magnus, Seumer Julius, Hemmingsen Laura K, Frei Angelo, Balcells David, Jensen Jan H

机构信息

Department of Chemistry, University of Copenhagen, Copenhagen, Denmark.

Department of Chemistry, University of York, York, UK.

出版信息

J Cheminform. 2025 Apr 28;17(1):63. doi: 10.1186/s13321-025-01008-1.

Abstract

We present a method for creating RDKit-parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with. Therefore, sanity-checking a method has required manual work. Therefore, we also generate SMILES two other ways; one where ligand charges and TMC connectivity are based on natural bond orbital (NBO) analysis from density functional theory (DFT) calculations utilizing recent work by Kneiding et al. (Digit Discov 2: 618-633, 2023). Another one fixes SMILES available through the Cambridge Structural Database (CSD), making them parsable by RDKit. We compare these three different ways of obtaining SMILES for a subset of the CSD (tmQMg) and find >70% agreement for all three pairs. We utilize these SMILES to make simple molecular fingerprint (FP) and graph-based representations of the molecules to be used in the context of machine learning. Comparing with the graphs made by Kneiding et al. where nodes and edges are featurized with DFT properties, we find that depending on the target property (polarizability, HOMO-LUMO gap or dipole moment) the SMILES based representations can perform equally well. This makes them very suitable as baseline-models. Finally we present a dataset of 227k RDKit parsable SMILES for mononuclear TMCs in the CSD.Scientific contribution We present a method that can create RDKit-parsable SMILES strings of transition metal complexes (TMCs) from Cartesian coordinates and use it to create a dataset of 227k TMC SMILES strings. The RDKit-parsability allows us to generate perform machine learning studies of TMC properties using "standard" molecular representations such as fingerprints and 2D-graph convolution. We show that these relatively simple representations can perform quite well depending on the target property.

摘要

我们提出了一种基于过渡金属配合物(TMC)的xyz坐标和总电荷来创建可被RDKit解析的SMILES的方法。这可以看作是对程序xyz2mol的扩展,xyz2mol对有机分子执行相同的操作。唯一的依赖项是RDKit,这使得它具有广泛的适用性。在从结构生成TMC的SMILES方面,一直缺少一个可供比较的现有SMILES数据集。因此,对一种方法进行合理性检查需要人工操作。因此,我们还通过其他两种方式生成SMILES;一种方式是,配体电荷和TMC连接性基于利用Kneiding等人(《数字发现》2:618 - 633,2023年)近期工作通过密度泛函理论(DFT)计算得到的自然键轨道(NBO)分析。另一种方式是修正通过剑桥结构数据库(CSD)获得的SMILES,使其可被RDKit解析。我们比较了为CSD的一个子集(tmQMg)获取SMILES的这三种不同方式,发现所有三对方式的一致性都超过70%。我们利用这些SMILES来制作简单的分子指纹(FP)和基于图的分子表示,以便在机器学习中使用。与Kneiding等人制作的图进行比较,在那些图中节点和边用DFT属性进行了特征化,我们发现根据目标属性(极化率、HOMO - LUMO能隙或偶极矩),基于SMILES的表示可以表现得同样出色。这使得它们非常适合作为基线模型。最后,我们展示了一个包含227k个可被RDKit解析的CSD中单核TMC的SMILES的数据集。

科学贡献

我们提出了一种方法,该方法可以从笛卡尔坐标创建过渡金属配合物(TMC)的可被RDKit解析的SMILES字符串,并使用它来创建一个包含227k个TMC SMILES字符串的数据集。RDKit的可解析性使我们能够使用“标准”分子表示(如指纹和二维图卷积)对TMC属性进行机器学习研究。我们表明,根据目标属性,这些相对简单的表示可以表现得相当好。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0bee/12039060/cccaa29029e8/13321_2025_1008_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验