Koczor-Benda Zsuzsanna, Gilkes Joe, Bartucca Francesco, Al-Fekaiki Abdulla, Maurer Reinhard J
Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.
Centre for Doctoral Training in Modelling of Heterogeneous Systems, University of Warwick, Coventry CV4 7AL, U.K.
J Chem Inf Model. 2025 Jul 14;65(13):6644-6654. doi: 10.1021/acs.jcim.5c00665. Epub 2025 Jun 24.
A range of generative machine learning models for the design of novel molecules and materials have been proposed in recent years. Models that can generate three-dimensional structures are particularly suitable for quantum chemistry workflows, enabling direct property prediction. The performance of generative models is typically assessed based on their ability to produce novel, valid, and unique molecules. However, equally important is their ability to learn the prevalence of functional groups and certain chemical moieties in the underlying training data, that is, to faithfully reproduce the chemical space spanned by the training data. Here, we investigate the ability of the autoregressive generative machine learning model G-SchNet to reproduce the chemical space and property distributions of training data sets composed of large, functional organic molecules. We assess the elemental composition, size- and bond-length distributions, as well as the functional group and chemical space distribution of training and generated molecules. By principal component analysis of the chemical space, we find that the model leads to a biased generation that is largely unaffected by the choice of hyperparameters or the training data set distribution, producing molecules that are, on average, less saturated and contain more heteroatoms. Purely aliphatic molecules are mostly absent in the generation. We further investigate generation with functional group constraints and based on composite data sets, which can help to partially remedy the model generation bias. Decision tree models can recognize the generation bias in the models and discriminate between training and generated data, revealing key chemical differences between the two sets. The chemical differences we find affect the distributions of electronic properties such as the HOMO-LUMO gap, which is a common target for functional molecule design.
近年来,人们提出了一系列用于设计新型分子和材料的生成式机器学习模型。能够生成三维结构的模型特别适用于量子化学工作流程,可实现直接的性质预测。生成模型的性能通常根据其生成新颖、有效和独特分子的能力来评估。然而,同样重要的是它们学习基础训练数据中官能团和某些化学部分的普遍性的能力,即忠实地再现训练数据所跨越的化学空间。在这里,我们研究自回归生成式机器学习模型G-SchNet再现由大型功能性有机分子组成的训练数据集的化学空间和性质分布的能力。我们评估训练分子和生成分子的元素组成、尺寸和键长分布,以及官能团和化学空间分布。通过对化学空间的主成分分析,我们发现该模型导致了一种有偏差的生成,这种偏差在很大程度上不受超参数选择或训练数据集分布的影响,生成的分子平均饱和度较低且含有更多杂原子。生成的分子中几乎没有纯脂肪族分子。我们进一步研究了具有官能团约束的生成以及基于复合数据集的生成,这有助于部分纠正模型生成偏差。决策树模型可以识别模型中的生成偏差,并区分训练数据和生成数据,揭示两组数据之间的关键化学差异。我们发现的化学差异会影响诸如HOMO-LUMO能隙等电子性质的分布,而HOMO-LUMO能隙是功能分子设计的一个常见目标。
J Chem Inf Model. 2025-7-14
2025-1
Cochrane Database Syst Rev. 2022-5-20
Arch Ital Urol Androl. 2025-6-30
Cochrane Database Syst Rev. 2021-8-13
Psychopharmacol Bull. 2024-7-8
Cochrane Database Syst Rev. 2008-7-16
Cochrane Database Syst Rev. 2018-2-6
Nat Comput Sci. 2023-2
J Phys Chem C Nanomater Interfaces. 2023-12-4
J Am Chem Soc. 2023-4-26
J Chem Inf Model. 2022-7-25