文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

有机分子三维自回归生成式机器学习中的结构偏差

Structural Bias in Three-Dimensional Autoregressive Generative Machine Learning of Organic Molecules.

作者信息

Koczor-Benda Zsuzsanna, Gilkes Joe, Bartucca Francesco, Al-Fekaiki Abdulla, Maurer Reinhard J

机构信息

Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.

Centre for Doctoral Training in Modelling of Heterogeneous Systems, University of Warwick, Coventry CV4 7AL, U.K.

出版信息

J Chem Inf Model. 2025 Jul 14;65(13):6644-6654. doi: 10.1021/acs.jcim.5c00665. Epub 2025 Jun 24.


DOI:10.1021/acs.jcim.5c00665
PMID:40556385
Abstract

A range of generative machine learning models for the design of novel molecules and materials have been proposed in recent years. Models that can generate three-dimensional structures are particularly suitable for quantum chemistry workflows, enabling direct property prediction. The performance of generative models is typically assessed based on their ability to produce novel, valid, and unique molecules. However, equally important is their ability to learn the prevalence of functional groups and certain chemical moieties in the underlying training data, that is, to faithfully reproduce the chemical space spanned by the training data. Here, we investigate the ability of the autoregressive generative machine learning model G-SchNet to reproduce the chemical space and property distributions of training data sets composed of large, functional organic molecules. We assess the elemental composition, size- and bond-length distributions, as well as the functional group and chemical space distribution of training and generated molecules. By principal component analysis of the chemical space, we find that the model leads to a biased generation that is largely unaffected by the choice of hyperparameters or the training data set distribution, producing molecules that are, on average, less saturated and contain more heteroatoms. Purely aliphatic molecules are mostly absent in the generation. We further investigate generation with functional group constraints and based on composite data sets, which can help to partially remedy the model generation bias. Decision tree models can recognize the generation bias in the models and discriminate between training and generated data, revealing key chemical differences between the two sets. The chemical differences we find affect the distributions of electronic properties such as the HOMO-LUMO gap, which is a common target for functional molecule design.

摘要

近年来,人们提出了一系列用于设计新型分子和材料的生成式机器学习模型。能够生成三维结构的模型特别适用于量子化学工作流程,可实现直接的性质预测。生成模型的性能通常根据其生成新颖、有效和独特分子的能力来评估。然而,同样重要的是它们学习基础训练数据中官能团和某些化学部分的普遍性的能力,即忠实地再现训练数据所跨越的化学空间。在这里,我们研究自回归生成式机器学习模型G-SchNet再现由大型功能性有机分子组成的训练数据集的化学空间和性质分布的能力。我们评估训练分子和生成分子的元素组成、尺寸和键长分布,以及官能团和化学空间分布。通过对化学空间的主成分分析,我们发现该模型导致了一种有偏差的生成,这种偏差在很大程度上不受超参数选择或训练数据集分布的影响,生成的分子平均饱和度较低且含有更多杂原子。生成的分子中几乎没有纯脂肪族分子。我们进一步研究了具有官能团约束的生成以及基于复合数据集的生成,这有助于部分纠正模型生成偏差。决策树模型可以识别模型中的生成偏差,并区分训练数据和生成数据,揭示两组数据之间的关键化学差异。我们发现的化学差异会影响诸如HOMO-LUMO能隙等电子性质的分布,而HOMO-LUMO能隙是功能分子设计的一个常见目标。

相似文献

[1]
Structural Bias in Three-Dimensional Autoregressive Generative Machine Learning of Organic Molecules.

J Chem Inf Model. 2025-7-14

[2]
Short-Term Memory Impairment

2025-1

[3]
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024-12-1

[4]
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022-5-20

[5]
Management of urinary stones by experts in stone disease (ESD 2025).

Arch Ital Urol Androl. 2025-6-30

[6]
Direct composite resin fillings versus amalgam fillings for permanent posterior teeth.

Cochrane Database Syst Rev. 2021-8-13

[7]
Automated devices for identifying peripheral arterial disease in people with leg ulceration: an evidence synthesis and cost-effectiveness analysis.

Health Technol Assess. 2024-8

[8]
The Black Book of Psychotropic Dosing and Monitoring.

Psychopharmacol Bull. 2024-7-8

[9]
Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.

Cochrane Database Syst Rev. 2008-7-16

[10]
Systemic treatments for metastatic cutaneous melanoma.

Cochrane Database Syst Rev. 2018-2-6

本文引用的文献

[1]
High-throughput property-driven generative design of functional organic molecules.

Nat Comput Sci. 2023-2

[2]
Machine Learning Interatomic Potentials for Reactive Hydrogen Dynamics at Metal Surfaces Based on Iterative Refinement of Reaction Probabilities.

J Phys Chem C Nanomater Interfaces. 2023-12-4

[3]
Generative Models as an Emerging Paradigm in the Chemical Sciences.

J Am Chem Soc. 2023-4-26

[4]
Molecule Design Using Molecular Generative Models Constrained by Ligand-Protein Interactions.

J Chem Inf Model. 2022-7-25

[5]
Inverse design of 3d molecular structures with conditional generative neural networks.

Nat Commun. 2022-2-21

[6]
Detecting mid-infrared light by molecular frequency upconversion in dual-wavelength nanoantennas.

Science. 2021-12-3

[7]
Continuous-wave frequency upconversion with a molecular optomechanical nanocavity.

Science. 2021-12-3

[8]
3D-Scaffold: A Deep Learning Framework to Generate 3D Coordinates of Drug-like Molecules with Desired Scaffolds.

J Phys Chem B. 2021-11-11

[9]
Physically inspired deep learning of molecular excitations and photoemission spectra.

Chem Sci. 2021-6-30

[10]
Atomic structures and orbital energies of 61,489 crystal-forming organic molecules.

Sci Data. 2020-2-18

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索