Zhu Minghua, Xiao Zijun, Zhang Tao, Lu Guanghua
Key Laboratory of Integrated Regulation and Resources Development of Shallow Lakes of Ministry of Education, Hohai University, Nanjing 210098, China; College of Environment, Hohai University, Nanjing 210098, China.
Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China.
J Hazard Mater. 2025 Jan 15;482:136606. doi: 10.1016/j.jhazmat.2024.136606. Epub 2024 Nov 20.
Accurate prediction of bioaccumulation parameters is essential for assessing exposure, hazards, and risks of chemicals. However, the majority of prediction models on bioaccumulation parameters are individual models based on a single algorithm and lack model interpretation, resulting in unsatisfactory prediction accuracy due to inherent constraints of the algorithm and weak interpretability. Ensemble learning (EL) that combine multiple algorithms, coupled with SHapley Additive exPlanation (SHAP) method, may overcome the limitations. Herein, EL models were constructed for three bioaccumulation parameters using datasets covering 2496 chemicals. The EL models demonstrated superior prediction accuracy compared to both individual models developed in this study and those from previous research, achieving a coefficient of determination of up to 0.861 on the validation sets. Applicability domains were characterized using a structure-activity landscape-based (abbreviated as AD) methodology. The optimal EL models, together with the AD, were successfully used to predict bioaccumulation parameters for 4374 chemicals included in the Inventory of Existing Chemical Substances of China. Model interpretation using the SHAP method offered insight into key features influencing bioaccumulation potential, including hydrophobicity, water solubility, polarizability, ionization potential, weight, and volume of molecules. Overall, the study provides data and models to support the sound management and risk assessment of chemicals.
准确预测生物累积参数对于评估化学品的暴露、危害和风险至关重要。然而,大多数关于生物累积参数的预测模型都是基于单一算法的个体模型,缺乏模型解释,由于算法的固有局限性和较弱的可解释性,导致预测精度不尽人意。结合多种算法的集成学习(EL)与SHapley加法解释(SHAP)方法相结合,可能会克服这些局限性。在此,使用涵盖2496种化学品的数据集构建了针对三个生物累积参数的EL模型。与本研究中开发的个体模型以及先前研究中的模型相比,EL模型表现出卓越的预测精度,在验证集上的决定系数高达0.861。使用基于结构-活性景观的方法(简称为AD)对适用域进行了表征。最优的EL模型与AD一起成功用于预测中国现有化学物质清单中包含的4374种化学品的生物累积参数。使用SHAP方法进行的模型解释揭示了影响生物累积潜力的关键特征,包括分子的疏水性、水溶性、极化率、电离势、重量和体积。总体而言,该研究提供了数据和模型,以支持化学品的合理管理和风险评估。