Suppr超能文献

使用哈特里-福克计算数据和机器学习模型预测最高已占分子轨道-最低未占分子轨道能隙

Predicting HOMO-LUMO Gaps Using Hartree-Fock Calculated Data and Machine Learning Models.

作者信息

Hasan Md Mehedi, Tarkhaneh Omid, Bungay Sharene D, Poirier Raymond A, Islam Shahidul M

机构信息

Department of Chemistry, Delaware State University, Dover, Delaware 19901, United States.

Department of Computer Science, Memorial University of Newfoundland, St. John's, Newfoundland and Labrador A1B 3X5, Canada.

出版信息

J Chem Inf Model. 2025 Sep 22;65(18):9497-9515. doi: 10.1021/acs.jcim.5c01412. Epub 2025 Sep 10.

Abstract

The calculation of the highest occupied molecular orbital-lowest unoccupied molecular orbital (HOMO-LUMO) gap for chemical molecules is computationally intensive using quantum mechanics (QM) methods, while experimental determination is often costly and time-consuming. Machine Learning (ML) offers a cost-effective and rapid alternative, enabling efficient predictions of HOMO-LUMO gap values across large data sets without the need for extensive QM computations or experiments. ML models facilitate the screening of diverse molecules, providing valuable insights into complex chemical spaces and integrating seamlessly into high-throughput workflows to prioritize candidates for experimental validation. In this study, we leveraged a data set of HOMO-LUMO gap values for small molecules obtained through Hartree-Fock (HF) calculations and developed ML models to predict HOMO-LUMO energy gaps for organic molecules. Molecular descriptors generated from Simplified Molecular Input Line Entry System (SMILES) representations using RDKit were used as input features to train various regression-based ML models. The data set included 46,717 small molecules with carbon chain number ranging from 1 to 8. Among the tested models, LightGBM regressor, Bidirectional LSTM, CatBoost regressor, and Multilayer Perceptron (MLP) achieved mean absolute error (MAE) values below 0.25 eV. Further improvement was achieved by creating a weighted ensemble model combining the LightGBM regressor, Bidirectional LSTM, and MLP, resulting in a prediction accuracy with an MAE of 0.1660 eV. This ensemble model outperformed others across various data sets, with the LightGBM regressor showing better performance for predicting the HOMO-LUMO gap of saturated linear molecules. SHAP analysis was conducted which identified 20 molecular descriptors critical for accurate predictions. Additionally, the models were empirically adapted to estimate experimental HOMO-LUMO gap values for both small and large molecules (up to carbon number 50), demonstrating their versatility and practical applicability.

摘要

使用量子力学(QM)方法计算化学分子的最高占据分子轨道-最低未占据分子轨道(HOMO-LUMO)能隙计算量很大,而实验测定通常成本高昂且耗时。机器学习(ML)提供了一种经济高效且快速的替代方法,能够在无需大量QM计算或实验的情况下,对大数据集中的HOMO-LUMO能隙值进行高效预测。ML模型有助于筛选各种分子,为复杂的化学空间提供有价值的见解,并无缝集成到高通量工作流程中,以确定实验验证的优先候选物。在本研究中,我们利用通过Hartree-Fock(HF)计算获得的小分子HOMO-LUMO能隙值数据集,开发了ML模型来预测有机分子的HOMO-LUMO能隙。使用RDKit从简化分子输入线性输入系统(SMILES)表示生成的分子描述符作为输入特征,来训练各种基于回归的ML模型。该数据集包括46717个碳链数从1到8的小分子。在测试的模型中,LightGBM回归器、双向长短期记忆网络(Bidirectional LSTM)、CatBoost回归器和多层感知器(MLP)的平均绝对误差(MAE)值均低于0.25电子伏特。通过创建一个结合LightGBM回归器、双向LSTM和MLP的加权集成模型,进一步提高了预测精度,得到了MAE为0.1660电子伏特的预测准确率。该集成模型在各种数据集上均优于其他模型,其中LightGBM回归器在预测饱和线性分子的HOMO-LUMO能隙方面表现更好。进行了SHAP分析,确定了20个对准确预测至关重要的分子描述符。此外,这些模型经过经验调整,可用于估计小分子和大分子(碳数高达50)的实验HOMO-LUMO能隙值,证明了它们的通用性和实际适用性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验