Jung Son Gyo, Jung Guwon, Cole Jacqueline M
Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
J Chem Inf Model. 2024 Mar 11;64(5):1486-1501. doi: 10.1021/acs.jcim.3c01792. Epub 2024 Feb 29.
Molecular design depends heavily on optical properties for applications such as solar cells and polymer-based batteries. Accurate prediction of these properties is essential, and multiple predictive methods exist, from to data-driven techniques. Although theoretical methods, such as time-dependent density functional theory (TD-DFT) calculations, have well-established physical relevance and are among the most popular methods in computational physics and chemistry, they exhibit errors that are inherent in their approximate nature. These high-throughput electronic structure calculations also incur a substantial computational cost. With the emergence of big-data initiatives, cost-effective, data-driven methods have gained traction, although their usability is highly contingent on the degree of data quality and sparsity. In this study, we present a workflow that employs deep residual convolutional neural networks (DR-CNN) and gradient boosting feature selection to predict peak optical absorption wavelengths (λ) exclusively from SMILES representations of dye molecules and solvents; one would normally measure λ using UV-vis absorption spectroscopy. We use a multifidelity modeling approach, integrating 34,893 DFT calculations and 26,395 experimentally derived λ data, to deliver more accurate predictions via a Bayesian-optimized gradient boosting machine. Our approach is benchmarked against the state of the art that is reported in the scientific literature; results demonstrate that learnt representations via a DR-CNN workflow that is integrated with other machine learning methods can accelerate the design of molecules for specific optical characteristics.
分子设计在很大程度上依赖于光学特性,以用于太阳能电池和聚合物基电池等应用。准确预测这些特性至关重要,并且存在多种预测方法,从[具体方法未提及]到数据驱动技术。尽管理论方法,如含时密度泛函理论(TD-DFT)计算,具有公认的物理相关性,并且是计算物理和化学中最流行的方法之一,但它们存在近似性质所固有的误差。这些高通量电子结构计算还会产生巨大的计算成本。随着大数据计划的出现,具有成本效益的数据驱动方法受到了关注,尽管它们的可用性高度取决于数据质量和稀疏程度。在本研究中,我们提出了一种工作流程,该流程采用深度残差卷积神经网络(DR-CNN)和梯度提升特征选择,仅从染料分子和溶剂的SMILES表示中预测峰值光吸收波长(λ);通常会使用紫外可见吸收光谱法测量λ。我们使用多保真度建模方法,整合34,893次DFT计算和26,395个实验得出的λ数据,通过贝叶斯优化的梯度提升机进行更准确的预测。我们的方法以科学文献中报道的现有技术为基准;结果表明,通过与其他机器学习方法集成的DR-CNN工作流程学习到的表示可以加速针对特定光学特性的分子设计。