Liang Youheng, Huangfu Xiaoliu, Huang Ruixing, Han Zhenpeng, Wu Sisi, Wang Jingrui, Long Xinlong, Ma Jun, He Qiang
Key Laboratory of Eco-Environments in Three Gorges Reservoir Region, Ministry of Education, College of Environment, and Ecology, Chongqing University, Chongqing 400044, China.
Key Laboratory of Eco-Environments in Three Gorges Reservoir Region, Ministry of Education, College of Environment, and Ecology, Chongqing University, Chongqing 400044, China.
J Hazard Mater. 2024 Jul 5;472:134501. doi: 10.1016/j.jhazmat.2024.134501. Epub 2024 May 6.
Rapid advances in machine learning (ML) provide fast, accurate, and widely applicable methods for predicting free radical-mediated organic pollutant reactivity. In this study, the rate constants (logk) of four halogen radicals were predicted using Morgan fingerprint (MF) and Mordred descriptor (MD) in combination with a series of ML models. The findings highlighted that making accurate predictions for various datasets depended on an effective combination of descriptors and algorithms. To further alleviate the challenge of limited sample size, we introduced a data combination strategy that improved prediction accuracy and mitigated overfitting by combining different datasets. The Light Gradient Boosting Machine (LightGBM) with MF and Random Forest (RF) with MD models based on the unified dataset were finally selected as the optimal models. The SHapley Additive exPlanations revealed insights: the MF-LightGBM model successfully captured the influence of electron-withdrawing/donating groups, while autocorrelation, walk count and information content descriptors in the MD-RF model were identified as key features. Furthermore, the important contribution of pH was emphasized. The results of the applicability domain analysis further supported that the developed model can make reliable predictions for query compounds across a broader range. Finally, a practical web application for logk calculations was built.
机器学习(ML)的快速发展为预测自由基介导的有机污染物反应性提供了快速、准确且广泛适用的方法。在本研究中,结合一系列ML模型,使用摩根指纹(MF)和莫德雷德描述符(MD)预测了四种卤素自由基的速率常数(logk)。研究结果突出表明,对各种数据集进行准确预测取决于描述符和算法的有效结合。为了进一步缓解样本量有限的挑战,我们引入了一种数据组合策略,通过组合不同数据集提高了预测准确性并减轻了过拟合。最终,基于统一数据集的带有MF的轻梯度提升机(LightGBM)和带有MD的随机森林(RF)模型被选为最优模型。SHapley加性解释揭示了一些见解:MF-LightGBM模型成功捕捉了吸电子/供电子基团的影响,而MD-RF模型中的自相关、游走计数和信息内容描述符被确定为关键特征。此外,强调了pH的重要贡献。适用域分析结果进一步支持所开发的模型能够对更广泛范围内的查询化合物做出可靠预测。最后,构建了一个用于logk计算的实用网络应用程序。