Haciefendioglu Tugba, Yildirim Erol
Department of Chemistry, Middle East Technical University, 06800 Ankara, Turkey.
Department of Polymer Science and Technology, Middle East Technical University, 06800 Ankara, Turkey.
J Chem Inf Model. 2025 Jun 9;65(11):5360-5369. doi: 10.1021/acs.jcim.5c00345. Epub 2025 May 28.
The performance and reliability of machine learning (ML)-quantitative structure-property relationship (QSPR) models depend on the quality, size, and diversity of the data set used for model training. In this study, we manually curated a large-scale data set containing 3120 donor-acceptor (D-A) conjugated polymers (CPs) by selecting the most utilized 60 donors and 52 acceptors. This data set serves as a valuable resource for ML-based prediction of key electronic properties such as band gap energy () and hole reorganization energy (λ), calculated using density functional theory (DFT) to advance organic photovoltaics (OPV). Beyond data set construction, we systematically investigated how different descriptor and fingerprint types impact performance of the ML model. Recognizing that not all features contributed equally to the model performance, we conducted an in-depth analysis to identify the most informative descriptors for the fundamental optoelectronic properties. Our findings show that kernel partial least-squares (KPLS) regression utilizing radial and molprint2D fingerprints achieved the highest accuracy in predicting , with values of 0.899 and 0.897, respectively. For λ prediction, models integrating electronic descriptors such as frontier orbital energy levels significantly improved performance, achieving an value of 0.830. This study provides a comprehensive investigation of how different descriptors influence model performance in OPV research. By analyzing why certain models succeed while others fail, our findings offer insight into feature selection and data set optimization for accurate target property prediction in organic electronics. The developed ML models provide a predictive framework for high-performance OPV materials design, significantly reducing the reliance on labor-intensive experimental procedures and computationally expensive first-principle calculations.
机器学习(ML)定量结构-性质关系(QSPR)模型的性能和可靠性取决于用于模型训练的数据集的质量、大小和多样性。在本研究中,我们通过挑选使用最为广泛的60种给体和52种受体,人工整理了一个包含3120种给体-受体(D-A)共轭聚合物(CP)的大规模数据集。该数据集是基于机器学习预测关键电子性质(如带隙能量()和空穴重组能(λ))的宝贵资源,这些性质通过密度泛函理论(DFT)计算得出,以推动有机光伏(OPV)的发展。除了数据集构建,我们系统地研究了不同描述符和指纹类型如何影响ML模型的性能。认识到并非所有特征对模型性能的贡献都是均等的,我们进行了深入分析,以确定用于基本光电性质的最具信息量的描述符。我们的研究结果表明,利用径向指纹和molprint2D指纹的核偏最小二乘(KPLS)回归在预测时达到了最高准确率,值分别为0.899和0.897。对于λ预测,整合诸如前沿轨道能级等电子描述符的模型显著提高了性能,值达到0.830。本研究全面调查了不同描述符如何影响OPV研究中的模型性能。通过分析某些模型成功而其他模型失败的原因,我们的研究结果为有机电子学中准确预测目标性质的特征选择和数据集优化提供了见解。所开发的ML模型为高性能OPV材料设计提供了一个预测框架,显著减少了对劳动密集型实验程序和计算成本高昂的第一性原理计算的依赖。